* Concerns about SFrame viability for userspace stack walking
@ 2025-10-30 6:53 Fangrui Song
2025-10-30 7:30 ` Jakub Jelinek
` (4 more replies)
0 siblings, 5 replies; 30+ messages in thread
From: Fangrui Song @ 2025-10-30 6:53 UTC (permalink / raw)
To: linux-toolchains, linux-perf-users, linux-kernel
I've been following the SFrame discussion and wanted to share some concerns about its viability for userspace adoption, based on concrete measurements and comparison with existing compact unwind implementations in LLVM.
**Size overhead concerns**
Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is approximately 10% larger than the combined size of .eh_frame and .eh_frame_hdr (8.06 MiB total).
This is problematic because .eh_frame cannot be eliminated - it contains essential information for restoring callee-saved registers, LSDA, and personality information needed for debugging (e.g. reading local variables in a coredump) and C++ exception handling.
This means adopting SFrame would result in carrying both formats, with a large net size increase.
**Learning from existing compact unwind implementations**
It's worth noting that LLVM has had a battle-tested compact unwind format in production use since 2009 with OS X 10.6, which transitioned to using CFI directives in 2013 [1]. The efficiency gains are dramatic:
__text section: 0x4a55470 bytes
__unwind_info section: 0x79060 bytes (0.6% of __text)
__eh_frame section: 0x58 bytes
(On macOS you can check the section size with objdump --arch x86_64 -h clang and dump the unwind info with objdump --arch x86_64 --unwind-info clang)
OpenVMS's x86-64 port, which is ELF-based, also adopted this format as documented in their "VSI OpenVMS Calling Standard" and their 2018 post: https://discourse.llvm.org/t/rfc-asynchronous-unwind-tables-attribute/59282
The compact unwind format achieves this efficiency through a two-level page table structure. It describes common frame layouts compactly and falls back to DWARF only when necessary, allowing most DWARF CFI entries to be eliminated while maintaining full functionality. For more details, see: https://faultlore.com/blah/compact-unwinding/ and the lld/MachO implemention https://github.com/llvm/llvm-project/blob/main/lld/MachO/UnwindInfoSection.cpp
**The AArch64 case: size matters even more**
The size consideration becomes even more critical for AArch64, which is heavily deployed on mobile phones.
There's an active feature request for compact unwind support in the AArch64 ABI: https://github.com/ARM-software/abi-aa/issues/344
This underscores the broader industry need for efficient unwind information that doesn't duplicate data or significantly increase binary size.
There are at least two formats the ELF one can learn from: LLVM's compact unwind format (aarch64) and Windows ARM64 Frame Unwind Code.
**Path forward**
Unless SFrame can actually replace .eh_frame (rather than supplementing it as an accelerator for linux-perf) and demonstrate sizes smaller than .eh_frame - matching the efficiency of existing compact unwind approaches — I question its practical viability for userspace.
The current design appears to add overhead rather than reduce it.
This isn't to suggest we should simply adopt the existing compact unwind format wholesale.
The x86-64 design dates back to 2009 or earlier, and there are likely improvements we can make. However, we should aim for similar or better efficiency gains.
For additional context, I've documented my detailed analysis at:
- https://maskray.me/blog/2025-09-28-remarks-on-sframe (covering mandatory index building problems, section group compliance and garbage collection issues, and version compatibility challenges)
- https://maskray.me/blog/2025-10-26-stack-walking-space-and-time-trade-offs (size analysis)
Best regards,
Fangrui
[1]: https://github.com/llvm/llvm-project/commit/58e2d3d856b7dc7b97a18cfa2aeeb927bc7e6bd5 ("Generate compact unwind encoding from CFI directives.")
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: Concerns about SFrame viability for userspace stack walking 2025-10-30 6:53 Concerns about SFrame viability for userspace stack walking Fangrui Song @ 2025-10-30 7:30 ` Jakub Jelinek 2025-10-30 7:50 ` Fangrui Song 2025-10-30 10:26 ` Peter Zijlstra ` (3 subsequent siblings) 4 siblings, 1 reply; 30+ messages in thread From: Jakub Jelinek @ 2025-10-30 7:30 UTC (permalink / raw) To: Fangrui Song; +Cc: linux-toolchains, linux-perf-users, linux-kernel On Wed, Oct 29, 2025 at 11:53:32PM -0700, Fangrui Song wrote: > I've been following the SFrame discussion and wanted to share some concerns about its viability for userspace adoption, based on concrete measurements and comparison with existing compact unwind implementations in LLVM. > > **Size overhead concerns** > > Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is approximately 10% larger than the combined size of .eh_frame and .eh_frame_hdr (8.06 MiB total). > This is problematic because .eh_frame cannot be eliminated - it contains essential information for restoring callee-saved registers, LSDA, and personality information needed for debugging (e.g. reading local variables in a coredump) and C++ exception handling. I believe .sframe only provides a subset of the .eh_frame information, so can't be used for exception throwing, and you don't want to lose .eh_frame_hdr either because then dlopen becomes very costly and it will even slow down exception throwing. If .eh_frame is considered too large, rather than inventing a new format I'd suggest to work in the DWARF committee and provide further size optimizations for .dwarf_frame which can then be used in .eh_frame, or agree on .eh_frame extensions to make it smaller. Jakub ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-10-30 7:30 ` Jakub Jelinek @ 2025-10-30 7:50 ` Fangrui Song 2025-10-30 8:05 ` Jakub Jelinek 0 siblings, 1 reply; 30+ messages in thread From: Fangrui Song @ 2025-10-30 7:50 UTC (permalink / raw) To: Jakub Jelinek Cc: Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel On Thu, Oct 30, 2025 at 12:30 AM Jakub Jelinek <jakub@redhat.com> wrote: > > On Wed, Oct 29, 2025 at 11:53:32PM -0700, Fangrui Song wrote: > > I've been following the SFrame discussion and wanted to share some concerns about its viability for userspace adoption, based on concrete measurements and comparison with existing compact unwind implementations in LLVM. > > > > **Size overhead concerns** > > > > Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is approximately 10% larger than the combined size of .eh_frame and .eh_frame_hdr (8.06 MiB total). > > This is problematic because .eh_frame cannot be eliminated - it contains essential information for restoring callee-saved registers, LSDA, and personality information needed for debugging (e.g. reading local variables in a coredump) and C++ exception handling. > > I believe .sframe only provides a subset of the .eh_frame information, so > can't be used for exception throwing, and you don't want to lose > .eh_frame_hdr either because then dlopen becomes very costly and it will > even slow down exception throwing. Right. > If .eh_frame is considered too large, rather than inventing a new format I'd > suggest to work in the DWARF committee and provide further size > optimizations for .dwarf_frame which can then be used in .eh_frame, or agree > on .eh_frame extensions to make it smaller. > > Jakub Thanks for the suggestion. An effective compact unwinding scheme needs to leverage ISA-specific properties. This architecture-specific nature makes it likely fall outside the DWARF's scope. That said, input from the DWARF committee would certainly be valuable. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-10-30 7:50 ` Fangrui Song @ 2025-10-30 8:05 ` Jakub Jelinek 2025-10-31 2:51 ` Fangrui Song 0 siblings, 1 reply; 30+ messages in thread From: Jakub Jelinek @ 2025-10-30 8:05 UTC (permalink / raw) To: Fangrui Song; +Cc: linux-toolchains, linux-perf-users, linux-kernel On Thu, Oct 30, 2025 at 12:50:42AM -0700, Fangrui Song wrote: > An effective compact unwinding scheme needs to leverage ISA-specific properties. Having 40-50 completely different unwinding schemes, one for each architecture or even ISA subset, would be a complete nightmare. Plus the important property of DWARF is that it is easily extensible. So, I think it would be better to invent new DWARF DW_CFA_* arch specific opcodes which would be a shorthand for the most common sequences of unwind info, or allow the CIEs to define a library of DW_CFA_* sets perhaps with parameters which would then be usable in the FDEs. There are already some arch specific opcodes, DW_CFA_GNU_window_save for SPARC and DW_CFA_AARCH64_negate_ra_state_with_pc/DW_CFA_AARCH64_negate_ra_state for AArch64, but if somebody took time to look through .eh_frame of many binaries/libraries on several different distributions for particular arch (so that there is no bias in what exact options those distros use etc.) and found something that keeps repeating there commonly that could be shortened, perhaps the assembler or linker could rewrite sequences of specific .cfi_* directives into something equivalent but shorter once the extension opcodes are added. Though, there are only very few opcodes left, so taking them should be done with great care and at least one should be left as a multiplexer (single byte opcode followed by uleb128 code for further operation + arguments). Jakub ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-10-30 8:05 ` Jakub Jelinek @ 2025-10-31 2:51 ` Fangrui Song 0 siblings, 0 replies; 30+ messages in thread From: Fangrui Song @ 2025-10-31 2:51 UTC (permalink / raw) To: Jakub Jelinek Cc: Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel On Thu, Oct 30, 2025 at 1:06 AM Jakub Jelinek <jakub@redhat.com> wrote: > > On Thu, Oct 30, 2025 at 12:50:42AM -0700, Fangrui Song wrote: > > An effective compact unwinding scheme needs to leverage ISA-specific properties. > > Having 40-50 completely different unwinding schemes, one for each > architecture or even ISA subset, would be a complete nightmare. Plus the > important property of DWARF is that it is easily extensible. So, I think it > would be better to invent new DWARF DW_CFA_* arch specific opcodes which > would be a shorthand for the most common sequences of unwind info, or allow > the CIEs to define a library of DW_CFA_* sets perhaps with parameters which > would then be usable in the FDEs. There are already some arch specific > opcodes, DW_CFA_GNU_window_save for SPARC and > DW_CFA_AARCH64_negate_ra_state_with_pc/DW_CFA_AARCH64_negate_ra_state for > AArch64, but if somebody took time to look through .eh_frame of many > binaries/libraries on several different distributions for particular arch > (so that there is no bias in what exact options those distros use etc.) and > found something that keeps repeating there commonly that could be shortened, > perhaps the assembler or linker could rewrite sequences of specific .cfi_* > directives into something equivalent but shorter once the extension opcodes > are added. Though, there are only very few opcodes left, so taking them > should be done with great care and at least one should be left as a > multiplexer (single byte opcode followed by uleb128 code for further > operation + arguments). > > Jakub That's a good point about being careful with new unwind formats. The LLVM compact unwind format, used by Mach-O, utilizes an architecture-agnostic page table structure but has architecture-specific opcode formats (i386, x86-64, and aarch64). I.e. it does not introduce an entirely different format for each arch. I believe the size issue with .eh_frame is primarily driven by the CIE/FDE overhead, not the CFI instructions. The inherently large size of a single FDE (around 20 bytes https://discourse.llvm.org/t/rfc-improving-compact-x86-64-compact-unwind-descriptors/47471/10?u=maskray ) is a significant contributor to overall size. The performance issue of .eh_frame seems largely related to the byte code nature of the CFI instructions. By encoding locations with different CFI states explicitly as different frame entries makes it faster. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-10-30 6:53 Concerns about SFrame viability for userspace stack walking Fangrui Song 2025-10-30 7:30 ` Jakub Jelinek @ 2025-10-30 10:26 ` Peter Zijlstra 2025-10-30 16:48 ` Fangrui Song 2025-10-30 17:53 ` Andi Kleen 2025-10-30 14:47 ` Jose E. Marchesi ` (2 subsequent siblings) 4 siblings, 2 replies; 30+ messages in thread From: Peter Zijlstra @ 2025-10-30 10:26 UTC (permalink / raw) To: Fangrui Song; +Cc: linux-toolchains, linux-perf-users, linux-kernel On Wed, Oct 29, 2025 at 11:53:32PM -0700, Fangrui Song wrote: > I've been following the SFrame discussion and wanted to share some > concerns about its viability for userspace adoption, based on concrete > measurements and comparison with existing compact unwind > implementations in LLVM. > > **Size overhead concerns** > > Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is > approximately 10% larger than the combined size of .eh_frame and > .eh_frame_hdr (8.06 MiB total). This is problematic because .eh_frame > cannot be eliminated - it contains essential information for restoring > callee-saved registers, LSDA, and personality information needed for > debugging (e.g. reading local variables in a coredump) and C++ > exception handling. > > This means adopting SFrame would result in carrying both formats, with > a large net size increase. So the SFrame unwinder is fairly simple code, but what does an .eh_frame unwinder look like? Having read most of the links in your email, there seem to be references to DWARF byte code interpreters and stuff like that. So while the format compactness is one aspect, the thing I find no mention of, is the unwinder complexity. There have been a number of attempts to do DWARF unwinding in kernel space and while I think some architecture do it, x86_64 has had very bad experiences with it. At some point I think Linus just said no more, no DWARF, not ever. So from a situation where compilers were generating bad CFI unwind information, a horribly complex unwinder that could crash the kernel harder than the thing it was reporting on and manual CFI annotations in assembly that were never quite right, objtool and ORC were born. The win was many: - simple robust unwinder - no manual CFI annotations that could be wrong - no reliance on compilers that would get it wrong and I think this is where SFrame came from. I don't think the x86_64 Linux kernel will ever natively adopt SFrame, ORC works really well for us. However, we do need something to unwind userspace. And yes, personally I'm in the frame-pointer camp, that's always worked well for me. Distro's however don't seem to like it much, which means that every time I do have to profile something userspace, I get to rebuild all the relevant code with framepointers on (which is not hard, but tedious). Barring that, we need something for which the unwind code is simple and robust -- and I *think* this has disqualified .eh_frame and full on DWARF. And this is again where SFrame comes in -- its unwinder is simple, something we can run in kernel space. I really don't much care for the particulars, and frame pointers work for me -- but I do care about the kernel unwinder code. It had better be simple and robvst. So if you want us to use .eh_frame, great, show us a simple and robust unwinder. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-10-30 10:26 ` Peter Zijlstra @ 2025-10-30 16:48 ` Fangrui Song 2025-10-30 17:03 ` Jose E. Marchesi ` (2 more replies) 2025-10-30 17:53 ` Andi Kleen 1 sibling, 3 replies; 30+ messages in thread From: Fangrui Song @ 2025-10-30 16:48 UTC (permalink / raw) To: Peter Zijlstra Cc: Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel On Thu, Oct 30, 2025 at 3:26 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Oct 29, 2025 at 11:53:32PM -0700, Fangrui Song wrote: > > I've been following the SFrame discussion and wanted to share some > > concerns about its viability for userspace adoption, based on concrete > > measurements and comparison with existing compact unwind > > implementations in LLVM. > > > > **Size overhead concerns** > > > > Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is > > approximately 10% larger than the combined size of .eh_frame and > > .eh_frame_hdr (8.06 MiB total). This is problematic because .eh_frame > > cannot be eliminated - it contains essential information for restoring > > callee-saved registers, LSDA, and personality information needed for > > debugging (e.g. reading local variables in a coredump) and C++ > > exception handling. > > > > This means adopting SFrame would result in carrying both formats, with > > a large net size increase. > > So the SFrame unwinder is fairly simple code, but what does an .eh_frame > unwinder look like? Having read most of the links in your email, there > seem to be references to DWARF byte code interpreters and stuff like > that. > > So while the format compactness is one aspect, the thing I find no > mention of, is the unwinder complexity. > > There have been a number of attempts to do DWARF unwinding in > kernel space and while I think some architecture do it, x86_64 has had > very bad experiences with it. At some point I think Linus just said no > more, no DWARF, not ever. > > So from a situation where compilers were generating bad CFI unwind > information, a horribly complex unwinder that could crash the kernel > harder than the thing it was reporting on and manual CFI annotations in > assembly that were never quite right, objtool and ORC were born. > > The win was many: > > - simple robust unwinder > - no manual CFI annotations that could be wrong > - no reliance on compilers that would get it wrong > > and I think this is where SFrame came from. I don't think the x86_64 > Linux kernel will ever natively adopt SFrame, ORC works really well for > us. > > However, we do need something to unwind userspace. And yes, personally > I'm in the frame-pointer camp, that's always worked well for me. > Distro's however don't seem to like it much, which means that every time > I do have to profile something userspace, I get to rebuild all the > relevant code with framepointers on (which is not hard, but tedious). > > Barring that, we need something for which the unwind code is simple and > robust -- and I *think* this has disqualified .eh_frame and full on > DWARF. > > And this is again where SFrame comes in -- its unwinder is simple, > something we can run in kernel space. > > I really don't much care for the particulars, and frame pointers work > for me -- but I do care about the kernel unwinder code. It had better be > simple and robvst. > > So if you want us to use .eh_frame, great, show us a simple and robust > unwinder. Hi Peter, Thanks for this perspective—the unwinder complexity concern is absolutely valid and critical for kernel use. To clarify my motivation: I've seen attempts to use SFrame for userspace adoption (https://fedoraproject.org/wiki/Changes/SFrameInBinaries ), and I believe it's not viable for that purpose given the size overhead I documented. My concerns are primarily about userspace adoption, not the kernel's internal unwinding. If SFrame is exclusively a kernel-space feature, it could be implemented entirely within objtool – similar to how objtool --link --orc generates ORC info for vmlinux.o. This approach would eliminate the need for any modifications to assemblers and linkers, while allowing SFrame to evolve in any incompatible way. For userspace, we could instead modify assemblers and linkers to support a more compact format or an extension to .eh_frame , but it won't be SFrame (all of Apple’s compact unwind, ARM EHABI’s exidx/extab, and Microsoft’s pdata/xdata can implement C++ exception handling , while SFrame can't, leading to a huge missed opportunity.) ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-10-30 16:48 ` Fangrui Song @ 2025-10-30 17:03 ` Jose E. Marchesi 2025-10-31 4:22 ` Fangrui Song 2025-10-30 17:33 ` Steven Rostedt 2025-10-30 18:22 ` Peter Zijlstra 2 siblings, 1 reply; 30+ messages in thread From: Jose E. Marchesi @ 2025-10-30 17:03 UTC (permalink / raw) To: Fangrui Song Cc: Peter Zijlstra, linux-toolchains, linux-perf-users, linux-kernel > On Thu, Oct 30, 2025 at 3:26 AM Peter Zijlstra <peterz@infradead.org> wrote: >> >> On Wed, Oct 29, 2025 at 11:53:32PM -0700, Fangrui Song wrote: >> > I've been following the SFrame discussion and wanted to share some >> > concerns about its viability for userspace adoption, based on concrete >> > measurements and comparison with existing compact unwind >> > implementations in LLVM. >> > >> > **Size overhead concerns** >> > >> > Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is >> > approximately 10% larger than the combined size of .eh_frame and >> > .eh_frame_hdr (8.06 MiB total). This is problematic because .eh_frame >> > cannot be eliminated - it contains essential information for restoring >> > callee-saved registers, LSDA, and personality information needed for >> > debugging (e.g. reading local variables in a coredump) and C++ >> > exception handling. >> > >> > This means adopting SFrame would result in carrying both formats, with >> > a large net size increase. >> >> So the SFrame unwinder is fairly simple code, but what does an .eh_frame >> unwinder look like? Having read most of the links in your email, there >> seem to be references to DWARF byte code interpreters and stuff like >> that. >> >> So while the format compactness is one aspect, the thing I find no >> mention of, is the unwinder complexity. >> >> There have been a number of attempts to do DWARF unwinding in >> kernel space and while I think some architecture do it, x86_64 has had >> very bad experiences with it. At some point I think Linus just said no >> more, no DWARF, not ever. >> >> So from a situation where compilers were generating bad CFI unwind >> information, a horribly complex unwinder that could crash the kernel >> harder than the thing it was reporting on and manual CFI annotations in >> assembly that were never quite right, objtool and ORC were born. >> >> The win was many: >> >> - simple robust unwinder >> - no manual CFI annotations that could be wrong >> - no reliance on compilers that would get it wrong >> >> and I think this is where SFrame came from. I don't think the x86_64 >> Linux kernel will ever natively adopt SFrame, ORC works really well for >> us. >> >> However, we do need something to unwind userspace. And yes, personally >> I'm in the frame-pointer camp, that's always worked well for me. >> Distro's however don't seem to like it much, which means that every time >> I do have to profile something userspace, I get to rebuild all the >> relevant code with framepointers on (which is not hard, but tedious). >> >> Barring that, we need something for which the unwind code is simple and >> robust -- and I *think* this has disqualified .eh_frame and full on >> DWARF. >> >> And this is again where SFrame comes in -- its unwinder is simple, >> something we can run in kernel space. >> >> I really don't much care for the particulars, and frame pointers work >> for me -- but I do care about the kernel unwinder code. It had better be >> simple and robvst. >> >> So if you want us to use .eh_frame, great, show us a simple and robust >> unwinder. > > Hi Peter, > > Thanks for this perspective—the unwinder complexity concern is > absolutely valid and critical for kernel use. > To clarify my motivation: I've seen attempts to use SFrame for > userspace adoption > (https://fedoraproject.org/wiki/Changes/SFrameInBinaries ), and I > believe it's not viable for that purpose given the size overhead I > documented. My concerns are primarily about userspace adoption, not > the kernel's internal unwinding. > > If SFrame is exclusively a kernel-space feature, it could be > implemented entirely within objtool – similar to how objtool --link > --orc generates ORC info for vmlinux.o. This approach would eliminate > the need for any modifications to assemblers and linkers, while > allowing SFrame to evolve in any incompatible way. > > For userspace, we could instead modify assemblers and linkers to > support a more compact format or an extension to .eh_frame , but it > won't be SFrame (all of Apple’s compact unwind, ARM EHABI’s > exidx/extab, and Microsoft’s pdata/xdata can implement C++ exception > handling , while SFrame can't, leading to a huge missed opportunity.) The purpose of SFrame is not to be a more compact replacement for .eh_frame. It is intended to be used to walk stacks, not to unwind them. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-10-30 17:03 ` Jose E. Marchesi @ 2025-10-31 4:22 ` Fangrui Song 2025-10-31 14:37 ` Jose E. Marchesi 0 siblings, 1 reply; 30+ messages in thread From: Fangrui Song @ 2025-10-31 4:22 UTC (permalink / raw) To: Jose E. Marchesi Cc: Fangrui Song, Peter Zijlstra, linux-toolchains, linux-perf-users, linux-kernel On Thu, Oct 30, 2025 at 10:04 AM Jose E. Marchesi <jose.marchesi@oracle.com> wrote: > > > > On Thu, Oct 30, 2025 at 3:26 AM Peter Zijlstra <peterz@infradead.org> wrote: > >> > >> On Wed, Oct 29, 2025 at 11:53:32PM -0700, Fangrui Song wrote: > >> > I've been following the SFrame discussion and wanted to share some > >> > concerns about its viability for userspace adoption, based on concrete > >> > measurements and comparison with existing compact unwind > >> > implementations in LLVM. > >> > > >> > **Size overhead concerns** > >> > > >> > Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is > >> > approximately 10% larger than the combined size of .eh_frame and > >> > .eh_frame_hdr (8.06 MiB total). This is problematic because .eh_frame > >> > cannot be eliminated - it contains essential information for restoring > >> > callee-saved registers, LSDA, and personality information needed for > >> > debugging (e.g. reading local variables in a coredump) and C++ > >> > exception handling. > >> > > >> > This means adopting SFrame would result in carrying both formats, with > >> > a large net size increase. > >> > >> So the SFrame unwinder is fairly simple code, but what does an .eh_frame > >> unwinder look like? Having read most of the links in your email, there > >> seem to be references to DWARF byte code interpreters and stuff like > >> that. > >> > >> So while the format compactness is one aspect, the thing I find no > >> mention of, is the unwinder complexity. > >> > >> There have been a number of attempts to do DWARF unwinding in > >> kernel space and while I think some architecture do it, x86_64 has had > >> very bad experiences with it. At some point I think Linus just said no > >> more, no DWARF, not ever. > >> > >> So from a situation where compilers were generating bad CFI unwind > >> information, a horribly complex unwinder that could crash the kernel > >> harder than the thing it was reporting on and manual CFI annotations in > >> assembly that were never quite right, objtool and ORC were born. > >> > >> The win was many: > >> > >> - simple robust unwinder > >> - no manual CFI annotations that could be wrong > >> - no reliance on compilers that would get it wrong > >> > >> and I think this is where SFrame came from. I don't think the x86_64 > >> Linux kernel will ever natively adopt SFrame, ORC works really well for > >> us. > >> > >> However, we do need something to unwind userspace. And yes, personally > >> I'm in the frame-pointer camp, that's always worked well for me. > >> Distro's however don't seem to like it much, which means that every time > >> I do have to profile something userspace, I get to rebuild all the > >> relevant code with framepointers on (which is not hard, but tedious). > >> > >> Barring that, we need something for which the unwind code is simple and > >> robust -- and I *think* this has disqualified .eh_frame and full on > >> DWARF. > >> > >> And this is again where SFrame comes in -- its unwinder is simple, > >> something we can run in kernel space. > >> > >> I really don't much care for the particulars, and frame pointers work > >> for me -- but I do care about the kernel unwinder code. It had better be > >> simple and robvst. > >> > >> So if you want us to use .eh_frame, great, show us a simple and robust > >> unwinder. > > > > Hi Peter, > > > > Thanks for this perspective—the unwinder complexity concern is > > absolutely valid and critical for kernel use. > > To clarify my motivation: I've seen attempts to use SFrame for > > userspace adoption > > (https://fedoraproject.org/wiki/Changes/SFrameInBinaries ), and I > > believe it's not viable for that purpose given the size overhead I > > documented. My concerns are primarily about userspace adoption, not > > the kernel's internal unwinding. > > > > If SFrame is exclusively a kernel-space feature, it could be > > implemented entirely within objtool – similar to how objtool --link > > --orc generates ORC info for vmlinux.o. This approach would eliminate > > the need for any modifications to assemblers and linkers, while > > allowing SFrame to evolve in any incompatible way. > > > > For userspace, we could instead modify assemblers and linkers to > > support a more compact format or an extension to .eh_frame , but it > > won't be SFrame (all of Apple’s compact unwind, ARM EHABI’s > > exidx/extab, and Microsoft’s pdata/xdata can implement C++ exception > > handling , while SFrame can't, leading to a huge missed opportunity.) > > The purpose of SFrame is not to be a more compact replacement for > .eh_frame. It is intended to be used to walk stacks, not to unwind > them. Hi Jose, Let me clarify my concerns, as I think we may be talking past each other a bit. **The primary concern: size overhead for userspace** The fundamental issue is that SFrame, as currently designed, results in a significant net size increase for userspace binaries because it is large and cannot replace .eh_frame (which would mean losing debugging and C++ exception handling support).The median .eh_frame size across executables and shared libraries on a Linux system is 5+% of total VM size: https://gist.github.com/MaskRay/5995d10b65e1e18b82931c5a8d97f55e Increasing this to 10% by adding SFrame on top is simply not viable. As my reply to Peter mentioned, "If SFrame is exclusively a kernel-space feature, it could be implemented entirely within objtool—similar to how objtool --link --orc generates ORC info for vmlinux.o." **What about kernel use?** As I mentioned in my reply to Peter, if SFrame is exclusively a kernel-space feature, it could be implemented entirely within objtool—similar to how objtool --link --orc generates ORC info for vmlinux.o. I believe SFrame has a size advantage over ORC, which could make it attractive for this use case. However, if SFrame will not replace the existing in-kernel ORC unwinder (as Peter suggested), then I'm afraid SFrame doesn't have a clear position—neither for vmlinux nor for userspace programs. **On the ELF format issues** https://groups.google.com/g/generic-abi/c/3ZMVJDF79g8 The current Binutils implementation disregards ELF and linker conventions, which is a serious concern for all linker maintainers. The proposed SHF_OS_NONCONFORMING_DISCARD flag has faced strong objections in the generic ABI discussion: https://groups.google.com/g/generic-abi/c/3ZMVJDF79g8 There are also unresolved garbage collection issues. I had to disable -Wl,--gc-sections entirely when testing for https://maskray.me/blog/2025-10-26-stack-walking-space-and-time-trade-offs I want to emphasize: custom merging rules do not inherently conflict with using proper multi-section structure with section group and SHF_LINK_ORDER. The format could be designed to work within established ELF conventions rather than requiring special cases throughout the linker. The concern about maintenance burden isn't about the initial implementation—it's about committing to long-term support for a format that requires custom handling in every linker while providing questionable benefit for its stated use case. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-10-31 4:22 ` Fangrui Song @ 2025-10-31 14:37 ` Jose E. Marchesi 0 siblings, 0 replies; 30+ messages in thread From: Jose E. Marchesi @ 2025-10-31 14:37 UTC (permalink / raw) To: Fangrui Song Cc: Peter Zijlstra, linux-toolchains, linux-perf-users, linux-kernel > On Thu, Oct 30, 2025 at 10:04 AM Jose E. Marchesi > <jose.marchesi@oracle.com> wrote: >> >> >> > On Thu, Oct 30, 2025 at 3:26 AM Peter Zijlstra <peterz@infradead.org> wrote: >> >> >> >> On Wed, Oct 29, 2025 at 11:53:32PM -0700, Fangrui Song wrote: >> >> > I've been following the SFrame discussion and wanted to share some >> >> > concerns about its viability for userspace adoption, based on concrete >> >> > measurements and comparison with existing compact unwind >> >> > implementations in LLVM. >> >> > >> >> > **Size overhead concerns** >> >> > >> >> > Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is >> >> > approximately 10% larger than the combined size of .eh_frame and >> >> > .eh_frame_hdr (8.06 MiB total). This is problematic because .eh_frame >> >> > cannot be eliminated - it contains essential information for restoring >> >> > callee-saved registers, LSDA, and personality information needed for >> >> > debugging (e.g. reading local variables in a coredump) and C++ >> >> > exception handling. >> >> > >> >> > This means adopting SFrame would result in carrying both formats, with >> >> > a large net size increase. >> >> >> >> So the SFrame unwinder is fairly simple code, but what does an .eh_frame >> >> unwinder look like? Having read most of the links in your email, there >> >> seem to be references to DWARF byte code interpreters and stuff like >> >> that. >> >> >> >> So while the format compactness is one aspect, the thing I find no >> >> mention of, is the unwinder complexity. >> >> >> >> There have been a number of attempts to do DWARF unwinding in >> >> kernel space and while I think some architecture do it, x86_64 has had >> >> very bad experiences with it. At some point I think Linus just said no >> >> more, no DWARF, not ever. >> >> >> >> So from a situation where compilers were generating bad CFI unwind >> >> information, a horribly complex unwinder that could crash the kernel >> >> harder than the thing it was reporting on and manual CFI annotations in >> >> assembly that were never quite right, objtool and ORC were born. >> >> >> >> The win was many: >> >> >> >> - simple robust unwinder >> >> - no manual CFI annotations that could be wrong >> >> - no reliance on compilers that would get it wrong >> >> >> >> and I think this is where SFrame came from. I don't think the x86_64 >> >> Linux kernel will ever natively adopt SFrame, ORC works really well for >> >> us. >> >> >> >> However, we do need something to unwind userspace. And yes, personally >> >> I'm in the frame-pointer camp, that's always worked well for me. >> >> Distro's however don't seem to like it much, which means that every time >> >> I do have to profile something userspace, I get to rebuild all the >> >> relevant code with framepointers on (which is not hard, but tedious). >> >> >> >> Barring that, we need something for which the unwind code is simple and >> >> robust -- and I *think* this has disqualified .eh_frame and full on >> >> DWARF. >> >> >> >> And this is again where SFrame comes in -- its unwinder is simple, >> >> something we can run in kernel space. >> >> >> >> I really don't much care for the particulars, and frame pointers work >> >> for me -- but I do care about the kernel unwinder code. It had better be >> >> simple and robvst. >> >> >> >> So if you want us to use .eh_frame, great, show us a simple and robust >> >> unwinder. >> > >> > Hi Peter, >> > >> > Thanks for this perspective—the unwinder complexity concern is >> > absolutely valid and critical for kernel use. >> > To clarify my motivation: I've seen attempts to use SFrame for >> > userspace adoption >> > (https://fedoraproject.org/wiki/Changes/SFrameInBinaries ), and I >> > believe it's not viable for that purpose given the size overhead I >> > documented. My concerns are primarily about userspace adoption, not >> > the kernel's internal unwinding. >> > >> > If SFrame is exclusively a kernel-space feature, it could be >> > implemented entirely within objtool – similar to how objtool --link >> > --orc generates ORC info for vmlinux.o. This approach would eliminate >> > the need for any modifications to assemblers and linkers, while >> > allowing SFrame to evolve in any incompatible way. >> > >> > For userspace, we could instead modify assemblers and linkers to >> > support a more compact format or an extension to .eh_frame , but it >> > won't be SFrame (all of Apple’s compact unwind, ARM EHABI’s >> > exidx/extab, and Microsoft’s pdata/xdata can implement C++ exception >> > handling , while SFrame can't, leading to a huge missed opportunity.) >> >> The purpose of SFrame is not to be a more compact replacement for >> .eh_frame. It is intended to be used to walk stacks, not to unwind >> them. > > Hi Jose, > > Let me clarify my concerns, as I think we may be talking past each > other a bit. Indeed, and thanks for following up :) > **The primary concern: size overhead for userspace** > > The fundamental issue is that SFrame, as currently designed, results > in a significant net size increase for userspace binaries because it > is large and cannot replace .eh_frame (which would mean losing > debugging and C++ exception handling support).The median .eh_frame > size across executables and shared libraries on a Linux system is 5+% > of total VM size: > > https://gist.github.com/MaskRay/5995d10b65e1e18b82931c5a8d97f55e > > Increasing this to 10% by adding SFrame on top is simply not viable. > As my reply to Peter mentioned, "If SFrame is exclusively a > kernel-space feature, it could be implemented entirely within > objtool—similar to how objtool --link --orc generates ORC info for > vmlinux.o." I understand your concern, but whether the size overhead introduced by SFrame is "viable" or not, I would say that is up to the users to decide, not us tools engineers. If someone wants to trade a 5% increase in size (or whatever amount, really) for improved traceability and/or performance, we are not going to convince them otherwise, especially if we cannot provide a working alternative that would give them a better tradeoff. > **What about kernel use?** > > As I mentioned in my reply to Peter, if SFrame is exclusively a > kernel-space feature, it could be implemented entirely within > objtool—similar to how objtool --link --orc generates ORC info for > vmlinux.o. > I believe SFrame has a size advantage over ORC, which could make it > attractive for this use case. > > However, if SFrame will not replace the existing in-kernel ORC > unwinder (as Peter suggested), then I'm afraid SFrame doesn't have a > clear position—neither for vmlinux nor for userspace programs. > > **On the ELF format issues** > > https://groups.google.com/g/generic-abi/c/3ZMVJDF79g8 > > The current Binutils implementation disregards ELF and linker > conventions, which is a serious concern for all linker maintainers. Sorry, but binutils doesn't disregard anything it wasn't disregarding before implementing SFrame, and certainly nothing that lld doesn't currently disregard as well, in the sense both linkers support other formats that require linker awareness for meaningful merging. > The proposed SHF_OS_NONCONFORMING_DISCARD flag has faced strong > objections in the generic ABI discussion: > https://groups.google.com/g/generic-abi/c/3ZMVJDF79g8 I would be disappointed otherwise: it is their job to resist change, as it is the job of everyone else to push for it whenever they feel is necessary. Don't get dishearted, it is just a single flag what is being proposed, that doesn't involve any sort of elaborated semantics and that is a clear logical complement to an existing flag. So I remain optimistic, but given there are only a bunch of ELF linkers around, IMO this flag proposal is more hygienic in nature than anything else and its absence is hardly a showstopper. It is clearly better to have "if (section_is_unknown && nonconforming_discard) { discard_input_section }" than "if (sframe && i_dont_support_sframe) { discard_input_section }" in a few places, but if it turns out we can't have it, well.. it isn't the end of the world. > There are also unresolved garbage collection issues. I had to disable > -Wl,--gc-sections entirely when testing for > https://maskray.me/blog/2025-10-26-stack-walking-space-and-time-trade-offs > I want to emphasize: custom merging rules do not inherently conflict > with using proper multi-section structure with section group and > SHF_LINK_ORDER. As I already pointed out in my previous reply, I think a solution was found for that and it is being worked out. > The format could be designed to work within established ELF > conventions rather than requiring special cases throughout the linker. Would you _please_ consider helping them to do so? I believe there is still time to get changes into V3, so if you have suggestions for improving SFrame in that regard, other than offloading complexity to clients or post-processing tools, or throwing the whole thing out with the bath water, by all means please reach out to them. > The concern about maintenance burden isn't about the initial > implementation—it's about committing to long-term support for a format > that requires custom handling in every linker while providing > questionable benefit for its stated use case. Thanks for explaining. So you are saying that the question of whether including SFrame support in lld boils down to the questionable benefit of SFrame for its stated use case, the primary concern there being the size overhead. Yes? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-10-30 16:48 ` Fangrui Song 2025-10-30 17:03 ` Jose E. Marchesi @ 2025-10-30 17:33 ` Steven Rostedt 2025-10-31 5:28 ` Fangrui Song 2025-10-30 18:22 ` Peter Zijlstra 2 siblings, 1 reply; 30+ messages in thread From: Steven Rostedt @ 2025-10-30 17:33 UTC (permalink / raw) To: Fangrui Song Cc: Peter Zijlstra, linux-toolchains, linux-perf-users, linux-kernel On Thu, 30 Oct 2025 09:48:50 -0700 Fangrui Song <maskray@sourceware.org> wrote: > If SFrame is exclusively a kernel-space feature, it could be > implemented entirely within objtool – similar to how objtool --link > --orc generates ORC info for vmlinux.o. This approach would eliminate > the need for any modifications to assemblers and linkers, while > allowing SFrame to evolve in any incompatible way. I'm not sure what you mean here. Yes, it is implemented in the kernel, but it is reading user space applications to get the sframes from them. Every running application would need this information for its executable. The kernel is dependent on user space having this. The only thing the kernel is doing is reading the sframe tables associated with the running applications to be able to walk their stacks at runtime to do profiling. As Peter asked, the kernel cares extensively on that walking being simple. If something goes wrong, you compromise the entire machine. -- Steve ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-10-30 17:33 ` Steven Rostedt @ 2025-10-31 5:28 ` Fangrui Song 0 siblings, 0 replies; 30+ messages in thread From: Fangrui Song @ 2025-10-31 5:28 UTC (permalink / raw) To: Steven Rostedt, Peter Zijlstra Cc: Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel On Thu, Oct 30, 2025 at 10:32 AM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Thu, 30 Oct 2025 09:48:50 -0700 > Fangrui Song <maskray@sourceware.org> wrote: > > > If SFrame is exclusively a kernel-space feature, it could be > > implemented entirely within objtool – similar to how objtool --link > > --orc generates ORC info for vmlinux.o. This approach would eliminate > > the need for any modifications to assemblers and linkers, while > > allowing SFrame to evolve in any incompatible way. > > I'm not sure what you mean here. Yes, it is implemented in the kernel, but > it is reading user space applications to get the sframes from them. > > Every running application would need this information for its executable. > The kernel is dependent on user space having this. > > The only thing the kernel is doing is reading the sframe tables associated > with the running applications to be able to walk their stacks at runtime to > do profiling. As Peter asked, the kernel cares extensively on that walking > being simple. If something goes wrong, you compromise the entire machine. > > -- Steve I suspect your concern is primarily with DWARF expressions (the DW_OP_* opcodes), which are needed for complex, unusual frames. If the perf subsystem ignores those and focuses only on standard frame layouts, ensuring safety becomes much more straightforward. Compact unwinding formats are designed with this principle in mind—they don't use bytecode CFI instructions or DWARF expressions at all. Instead, they use a binary search table (similar to .eh_frame_hdr) to locate the frame descriptor, then decode it using a straightforward nested switch statement based on the compact encoding. For reference, this is llvm-project/libunwind's implementation for i386, x86-64, and aarch64: https://github.com/llvm/llvm-project/blob/main/libunwind/src/CompactUnwinder.hpp ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-10-30 16:48 ` Fangrui Song 2025-10-30 17:03 ` Jose E. Marchesi 2025-10-30 17:33 ` Steven Rostedt @ 2025-10-30 18:22 ` Peter Zijlstra 2 siblings, 0 replies; 30+ messages in thread From: Peter Zijlstra @ 2025-10-30 18:22 UTC (permalink / raw) To: Fangrui Song; +Cc: linux-toolchains, linux-perf-users, linux-kernel On Thu, Oct 30, 2025 at 09:48:50AM -0700, Fangrui Song wrote: > Thanks for this perspective???the unwinder complexity concern is > absolutely valid and critical for kernel use. > To clarify my motivation: I've seen attempts to use SFrame for > userspace adoption > (https://fedoraproject.org/wiki/Changes/SFrameInBinaries ), and I > believe it's not viable for that purpose given the size overhead I > documented. My concerns are primarily about userspace adoption, not > the kernel's internal unwinding. > > If SFrame is exclusively a kernel-space feature, it could be > implemented entirely within objtool ??? similar to how objtool --link > --orc generates ORC info for vmlinux.o. This approach would eliminate > the need for any modifications to assemblers and linkers, while > allowing SFrame to evolve in any incompatible way. > > For userspace, we could instead modify assemblers and linkers to > support a more compact format or an extension to .eh_frame , but it > won't be SFrame (all of Apple???s compact unwind, ARM EHABI???s > exidx/extab, and Microsoft???s pdata/xdata can implement C++ exception > handling , while SFrame can't, leading to a huge missed opportunity.) No, you misunderstand. The x86_64 Linux kernel is using ORC internally and we're happy with that. However, the kernel also needs to be able to unwind/walk user stack frames. We need simple robust means of walking user space stacks from the kernel. It is here that SFrame is proposed on x86_64. The kernel consumes user space SFrame data to unwind user space stacks. This is also why the SFrame sections are SHF_ALLOC, such that the kernel can simply fault them in on-demand without having to otherwise initiate IO. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-10-30 10:26 ` Peter Zijlstra 2025-10-30 16:48 ` Fangrui Song @ 2025-10-30 17:53 ` Andi Kleen 2025-10-30 18:07 ` Mark Brown 1 sibling, 1 reply; 30+ messages in thread From: Andi Kleen @ 2025-10-30 17:53 UTC (permalink / raw) To: Peter Zijlstra Cc: Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel Peter Zijlstra <peterz@infradead.org> writes: > > So the SFrame unwinder is fairly simple code, but what does an .eh_frame > unwinder look like? Having read most of the links in your email, there > seem to be references to DWARF byte code interpreters and stuff like > that. Here's Jan Beulich's Linux implementation. The x86 version was removed, but it lives on for ARC: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arc/kernel/unwind.c SH also has another one from Matt Flemming: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/sh/kernel/dwarf.c IMNSHO the whole sframe effort is misguided because all the major ISAs do have shadow stack hardware support now which is generally a better option. It would be better to invest effort in deploying that widely. -Andi ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-10-30 17:53 ` Andi Kleen @ 2025-10-30 18:07 ` Mark Brown 2025-10-30 18:31 ` Andi Kleen 0 siblings, 1 reply; 30+ messages in thread From: Mark Brown @ 2025-10-30 18:07 UTC (permalink / raw) To: Andi Kleen Cc: Peter Zijlstra, Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel [-- Attachment #1: Type: text/plain, Size: 374 bytes --] On Thu, Oct 30, 2025 at 10:53:13AM -0700, Andi Kleen wrote: > IMNSHO the whole sframe effort is misguided because all the major ISAs do have > shadow stack hardware support now which is generally a better option. > It would be better to invest effort in deploying that widely. It's going to take a *considerable* time for the hardware support to become standard. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-10-30 18:07 ` Mark Brown @ 2025-10-30 18:31 ` Andi Kleen 2025-10-30 18:45 ` Mark Brown 2025-10-30 18:57 ` Peter Zijlstra 0 siblings, 2 replies; 30+ messages in thread From: Andi Kleen @ 2025-10-30 18:31 UTC (permalink / raw) To: Mark Brown Cc: Peter Zijlstra, Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel On Thu, Oct 30, 2025 at 06:07:49PM +0000, Mark Brown wrote: > On Thu, Oct 30, 2025 at 10:53:13AM -0700, Andi Kleen wrote: > > > IMNSHO the whole sframe effort is misguided because all the major ISAs do have > > shadow stack hardware support now which is generally a better option. > > It would be better to invest effort in deploying that widely. > > It's going to take a *considerable* time for the hardware support to > become standard. Optimizing for the past instead of the future? Not on x86 at least. All my x86 systems have it, except for a few old skylakes. -Andi ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-10-30 18:31 ` Andi Kleen @ 2025-10-30 18:45 ` Mark Brown 2025-10-31 8:24 ` Fangrui Song 2025-10-30 18:57 ` Peter Zijlstra 1 sibling, 1 reply; 30+ messages in thread From: Mark Brown @ 2025-10-30 18:45 UTC (permalink / raw) To: Andi Kleen Cc: Peter Zijlstra, Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel [-- Attachment #1: Type: text/plain, Size: 743 bytes --] On Thu, Oct 30, 2025 at 11:31:38AM -0700, Andi Kleen wrote: > On Thu, Oct 30, 2025 at 06:07:49PM +0000, Mark Brown wrote: > > It's going to take a *considerable* time for the hardware support to > > become standard. > Optimizing for the past instead of the future? On arm64 no currently available hardware has shadow stack support, and once systems start becoming available it'll take a very long time for that to filter down to even being all newly shipping systems, let alone all systems that people care about running new software on. > Not on x86 at least. All my x86 systems have it, except for a few old > skylakes. My experience trying to find a system to test changes on was somewhat different :( I did eventually get something. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-10-30 18:45 ` Mark Brown @ 2025-10-31 8:24 ` Fangrui Song 0 siblings, 0 replies; 30+ messages in thread From: Fangrui Song @ 2025-10-31 8:24 UTC (permalink / raw) To: Mark Brown Cc: Andi Kleen, Peter Zijlstra, Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel On Thu, Oct 30, 2025 at 11:45 AM Mark Brown <broonie@kernel.org> wrote: > > On Thu, Oct 30, 2025 at 11:31:38AM -0700, Andi Kleen wrote: > > On Thu, Oct 30, 2025 at 06:07:49PM +0000, Mark Brown wrote: > > > > It's going to take a *considerable* time for the hardware support to > > > become standard. > > > Optimizing for the past instead of the future? > > On arm64 no currently available hardware has shadow stack support, and > once systems start becoming available it'll take a very long time for > that to filter down to even being all newly shipping systems, let alone > all systems that people care about running new software on. > > > Not on x86 at least. All my x86 systems have it, except for a few old > > skylakes. > > My experience trying to find a system to test changes on was somewhat > different :( I did eventually get something. I’ve chatted with mobile toolchain developers at the LLVM Dev Mtg, who emphasized that size concerns are especially critical for AArch64, which is heavily deployed on mobile phones. I think Arm ABI makers are unlikely to want a format known not to work with mobile Linux to coexist with a future, more widely adopted compact format with callee-saved register, LSDA, and personality support. I chatted with Peter Smith, who seems to think so, but I don't want to put the word into his mouth:) --- Intel’s 11th Gen and AMD Zen 3 support hardware shadow stack. A software-only stack walking approach (and remains unvetted for AArch64-see above) that doesn’t replace .eh_frame would quickly become obsolete. Shadow stack can be enabled per process, providing flexibility to balance performance overhead / memory consumption with profiling needs, even for users who don’t prioritize the security hardening aspect. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-10-30 18:31 ` Andi Kleen 2025-10-30 18:45 ` Mark Brown @ 2025-10-30 18:57 ` Peter Zijlstra 2025-10-31 11:46 ` Mark Brown 1 sibling, 1 reply; 30+ messages in thread From: Peter Zijlstra @ 2025-10-30 18:57 UTC (permalink / raw) To: Andi Kleen Cc: Mark Brown, Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel On Thu, Oct 30, 2025 at 11:31:38AM -0700, Andi Kleen wrote: > Not on x86 at least. All my x86 systems have it, except for a few old > skylakes. About half of my systems have CET, but I just checked, none of them seem to actually use userspace shadow stacks. AFAICT Debian hasn't build their packages with this stuff on. But yeah, thanks for reminding me, we should definitely build a shstk unwinder. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-10-30 18:57 ` Peter Zijlstra @ 2025-10-31 11:46 ` Mark Brown 0 siblings, 0 replies; 30+ messages in thread From: Mark Brown @ 2025-10-31 11:46 UTC (permalink / raw) To: Peter Zijlstra Cc: Andi Kleen, Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel [-- Attachment #1: Type: text/plain, Size: 534 bytes --] On Thu, Oct 30, 2025 at 07:57:25PM +0100, Peter Zijlstra wrote: > On Thu, Oct 30, 2025 at 11:31:38AM -0700, Andi Kleen wrote: > > Not on x86 at least. All my x86 systems have it, except for a few old > > skylakes. > About half of my systems have CET, but I just checked, none of them > seem to actually use userspace shadow stacks. > AFAICT Debian hasn't build their packages with this stuff on. It hasn't - they're starting to roll it out now in the development distro, an actual release is a couple of years away at this point. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-10-30 6:53 Concerns about SFrame viability for userspace stack walking Fangrui Song 2025-10-30 7:30 ` Jakub Jelinek 2025-10-30 10:26 ` Peter Zijlstra @ 2025-10-30 14:47 ` Jose E. Marchesi 2025-11-04 9:21 ` Indu 2025-12-01 9:04 ` Fangrui Song 4 siblings, 0 replies; 30+ messages in thread From: Jose E. Marchesi @ 2025-10-30 14:47 UTC (permalink / raw) To: Fangrui Song; +Cc: linux-toolchains, linux-perf-users, linux-kernel Hi Fangrui. > - https://maskray.me/blog/2025-09-28-remarks-on-sframe (covering > mandatory index building problems, section group compliance and > garbage collection issues, and version compatibility challenges) After reading your blog it seems to me that your main concern is (was?) that SFrame "violates ELF rules" because it is not amenable to concatenation (the result of concatenating two "sframes" is not a valid sframe) and thus it requires specific linker support to merge these sections. This support, once in place, will have to be maintained moving forward, and evolved along with the format. First, SFrame is not concatenable because its main design goal is to be simple to _use_ (not necessarily trivial to link) so it provides little luxuries like a fixed-size header (uoh), it is self-contained, it does not require an explicit index to be searched efficiently (instead you just binary search on the data in place), it has no run-time relocations, etc. You don't need to allocate memory dynamically to decode and stack-walking using SFrame, and it has even been proved (by the parca project people) that it is possible to write actual verifiable BPF to walk an userspace stack using an internal format that is in essence identical to SFrame. You may not care about any of that, but the people wanting to use the format certainly do, and thats the reason SFrame (and ORC for that matter) is the way it is. Sure, nobody would object making SFrame concatenable to make your (and mine, incidentaly) life easier, but not at the cost of burdening users with extra complexity that they dont need and are not willing to assume: why would they? We are not even quite sure if such a thing is achievable in this case: you either put the complexity in the linker, or on the users, but you cannot make it magically disappear. If you have some _concrete_ suggestion on improving the format, please by all means let the SFrame people know, or consider following-up in threads like [1] where these details are being discussed. That would be helpful indeed. Second, as bizarre as it may be, having non-concatenable data in an ELF section only "violates ELF rules" if the linker _doesn't know_ about the type of the section containing it. So the solution is obvious: make your linker aware of SFrame sections and, voilà, the ELF violation goes away. You can either merge the SFrame data, or just discard the input sections, or call the Linker Police. Just do _something_ about it, because doing nothing leads to emitting nonsense, and that pisses off everyone. Third, this "problem" is not privative to SFrame. Other formats like EH Frame also require some degree of linker awareness, in the form of the generation of an explicit index, or merging, or whatever... apparenlty to nobody's scandal, lucky them. Pushing the burden of dealing with this to users or to post-processing tools, like you suggest in your blog, is IMO hardly a satisfactory solution: it is rather a no-solution and an attempt of making your problem everyone's problem. Now, if you then move the goalpost and claim the problem is the _degree_ of linker involvement, as you seem to suggest in your blog, then again your feedback is very welcome to make SFrame more linker-friendly, as long as it is in the form of concrete construtive suggestions _and_ not at the cost of the user's requirements for the format. Fourth, some people think that it is unreasonable to expect all the ELF linkers in existence to be aware of SFrame sections (not me; you can count them all linkers with fingers and no toes). ELF already supports a standard section flag SHF_OS_NONCONFORMAT that tries to deal with cases like this... and fails miserably: unknown sections marked with that flag are not required to be amenable to concatenation, but the problem is that upon encountering them the linker is expected to abort the link with an error. This is hardly convenient for anyone, so the SFrame people are currently proposing to the gABI [2] the addition of a new flag SHF_OS_NONCONFORMANT_DISCARD that would make the linker to just discard the unknown input section rather than aborting the link. Fifth, the problems related to GC and section grouping were discussed during Cauldron [3] and I believe a solution has been already found, proposed independently by Roland in the gabi discussion thread. I think that solution is being written down and will have to be reviewed before being used in SFrame V3. Your help on that review would be also very much appreciated, considering your vast experience on these matters. I suppose it will happen in the binutils list. I am sure they will CC you in the relevant thread so you won't miss it. Sixth, several people have repeatedly pointed out that it is not reasonable to extrapolate the big gap between SFrame V2 and V3 to future revisions of the format, and that not implementing V2 in lld is perfectly ok, because the kernel will start directly with V3. The SFrame maintainer has assured, also repeteadly, that she is well aware that any change in the spec will have to be very carefully considered to avoid or minimize any impact in the linkers. I would say the fact she also maintains the SFrame support in ld may also serve as bail to guarantee her good behavior, at least to some extent ;) Finally, it is not clear to me at all why supporting SFrame would result in such an unbearable burden to lld's maintenance, as you seem to expect, given everyhing is fine on the GNU side. Please don't misunderstand me: you know your linker better than anyone else and I am sure there are good reasons that justify such apprehension. What is it? Is it lack of contributors? Is the lld codebase particularly difficult to maintain or extend? Looking at [4] the people that are contributing the SFrame support to LLVM are also volunteering to maintain it moving forward.. isn't that enough? Perhaps you need a co-maintainer? Can we, or anyone else, help somehow? If so, how? Salud! [1] https://sourceware.org/pipermail/binutils/2025-October/145086.html [2] https://groups.google.com/g/generic-abi/c/3ZMVJDF79g8 [3] https://www.youtube.com/watch?v=L2UmAp39xqk [4] https://discourse.llvm.org/t/rfc-adding-sframe-support-to-llvm/86900 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-10-30 6:53 Concerns about SFrame viability for userspace stack walking Fangrui Song ` (2 preceding siblings ...) 2025-10-30 14:47 ` Jose E. Marchesi @ 2025-11-04 9:21 ` Indu 2025-11-05 8:21 ` Fangrui Song 2025-12-01 9:04 ` Fangrui Song 4 siblings, 1 reply; 30+ messages in thread From: Indu @ 2025-11-04 9:21 UTC (permalink / raw) To: Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel On 2025-10-29 11:53 p.m., Fangrui Song wrote: > I've been following the SFrame discussion and wanted to share some > concerns about its viability for userspace adoption, based on concrete > measurements and comparison with existing compact unwind implementations > in LLVM. > > **Size overhead concerns** > > Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is > approximately 10% larger than the combined size of .eh_frame > and .eh_frame_hdr (8.06 MiB total). > This is problematic because .eh_frame cannot be eliminated - it contains > essential information for restoring callee-saved registers, LSDA, and > personality information needed for debugging (e.g. reading local > variables in a coredump) and C++ exception handling. > > This means adopting SFrame would result in carrying both formats, with a > large net size increase. > > **Learning from existing compact unwind implementations** > > It's worth noting that LLVM has had a battle-tested compact unwind > format in production use since 2009 with OS X 10.6, which transitioned > to using CFI directives in 2013 [1]. The efficiency gains are dramatic: > > __text section: 0x4a55470 bytes > __unwind_info section: 0x79060 bytes (0.6% of __text) > __eh_frame section: 0x58 bytes > I believe this is only synchronous? If yes, do you think this is a fair measurement to compare against ? Does the compact unwind info scheme work well for cases of shrink-wrapping ? How about the case of AArch64, where the ABI does not mandate if and where frame record is created ? For the numbers above, does it ensure precise stack traces ? From the The Apple Compact Unwinding Format document (https://faultlore.com/blah/compact-unwinding/), "One consequence of only having one opcode for a whole function is that functions will generally have incorrect instructions for the function’s prologue (where callee-saved registers are individually PUSHed onto the stack before the rest of the stack space is allocated)." "Presumably this isn’t a very big deal, since there’s very few situations where unwinding would involve a function still executing its prologue/epilogue." Well, getting precise stack traces is a big deal and the users want them. > (On macOS you can check the section size with objdump --arch x86_64 - > h clang and dump the unwind info with objdump --arch x86_64 --unwind- > info clang) > > OpenVMS's x86-64 port, which is ELF-based, also adopted this format as > documented in their "VSI OpenVMS Calling Standard" and their 2018 post: > https://discourse.llvm.org/t/rfc-asynchronous-unwind-tables-attribute/59282 > > The compact unwind format achieves this efficiency through a two-level > page table structure. It describes common frame layouts compactly and > falls back to DWARF only when necessary, allowing most DWARF CFI entries > to be eliminated while maintaining full functionality. For more details, > see: https://faultlore.com/blah/compact-unwinding/ and the lld/MachO > implemention https://github.com/llvm/llvm-project/blob/main/lld/MachO/ > UnwindInfoSection.cpp > How does your vision of "linker-friendly" stack tracing/stack unwinding format reconcile with these suggested approaches ? As far as I can tell, these formats also require linker created indexes and are non-concatenable (custom handling in every linker). Something you've had "significant concerns" about. From https://docs.vmssoftware.com/vsi-openvms-calling-standard/#STACK_UNWIND_EXCEPTION_X86_64: "The unwind dispatch table (see Section B.3.1, ''Unwind Dispatch Table'') is created by the linker using information in the unwind descriptors (see Section B.3.2, ''DWARF Unwind Descriptors'' and Section B.3.3, ''Compact Unwind Description'') provided by compilers. The linker may use the provided unwind descriptors directly or replace them with equivalent optimized forms based on its optimization strategies." Above all, do users want a solution which requires falling back on DWARF-based processing for precise stack tracing ? > **The AArch64 case: size matters even more** > > The size consideration becomes even more critical for AArch64, which is > heavily deployed on mobile phones. > There's an active feature request for compact unwind support in the > AArch64 ABI: https://github.com/ARM-software/abi-aa/issues/344 > This underscores the broader industry need for efficient unwind > information that doesn't duplicate data or significantly increase binary > size. > Our measurements with a dataset of about 1400 userspace artifacts (binaries and shared libraries) show that the SFrame/(EH Frame + EH Frame HDR) ratio is: - Average of 0.70 on AArch64. - Average of 1.00 on x86_64. Projecting the size of what you observe for clang binary on x86_64 to conclude the size ratio on AArch64 is not very wise to do. Whether the size impact is worth the benefit: its a choice for users to make. SFrame offers the users fast, precise stack traces with simple stack tracers. > There are at least two formats the ELF one can learn from: LLVM's > compact unwind format (aarch64) and Windows ARM64 Frame Unwind Code. > Please, if you have any concrete suggestions (keeping the above goals in mind), you already know how/where to engage. > **Path forward** > > Unless SFrame can actually replace .eh_frame (rather than supplementing > it as an accelerator for linux-perf) and demonstrate sizes smaller > than .eh_frame - matching the efficiency of existing compact unwind > approaches — I question its practical viability for userspace. > The current design appears to add overhead rather than reduce it. > This isn't to suggest we should simply adopt the existing compact unwind > format wholesale. > The x86-64 design dates back to 2009 or earlier, and there are likely > improvements we can make. However, we should aim for similar or better > efficiency gains. > > For additional context, I've documented my detailed analysis at: > > - https://maskray.me/blog/2025-09-28-remarks-on-sframe (covering > mandatory index building problems, section group compliance and garbage > collection issues, and version compatibility challenges) GC issue is a bug currently tracked and with a target milestone of 2.46. > - https://maskray.me/blog/2025-10-26-stack-walking-space-and-time-trade- > offs (size analysis) > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-11-04 9:21 ` Indu @ 2025-11-05 8:21 ` Fangrui Song 2025-11-06 0:44 ` Indu Bhagat 0 siblings, 1 reply; 30+ messages in thread From: Fangrui Song @ 2025-11-05 8:21 UTC (permalink / raw) To: Indu; +Cc: Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel > On Tue, Nov 4, 2025 at 1:21 AM Indu <indu.bhagat@oracle.com> wrote: > On 2025-10-29 11:53 p.m., Fangrui Song wrote: > > I've been following the SFrame discussion and wanted to share some > > concerns about its viability for userspace adoption, based on concrete > > measurements and comparison with existing compact unwind implementations > > in LLVM. > > > > **Size overhead concerns** > > > > Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is > > approximately 10% larger than the combined size of .eh_frame > > and .eh_frame_hdr (8.06 MiB total). > > This is problematic because .eh_frame cannot be eliminated - it contains > > essential information for restoring callee-saved registers, LSDA, and > > personality information needed for debugging (e.g. reading local > > variables in a coredump) and C++ exception handling. > > > > This means adopting SFrame would result in carrying both formats, with a > > large net size increase. > > > > **Learning from existing compact unwind implementations** > > > > It's worth noting that LLVM has had a battle-tested compact unwind > > format in production use since 2009 with OS X 10.6, which transitioned > > to using CFI directives in 2013 [1]. The efficiency gains are dramatic: > > > > __text section: 0x4a55470 bytes > > __unwind_info section: 0x79060 bytes (0.6% of __text) > > __eh_frame section: 0x58 bytes > > > > I believe this is only synchronous? If yes, do you think this is a fair > measurement to compare against ? > > Does the compact unwind info scheme work well for cases of > shrink-wrapping ? How about the case of AArch64, where the ABI does not > mandate if and where frame record is created ? > > For the numbers above, does it ensure precise stack traces ? > > From the The Apple Compact Unwinding Format document > (https://faultlore.com/blah/compact-unwinding/), > "One consequence of only having one opcode for a whole function is that > functions will generally have incorrect instructions for the function’s > prologue (where callee-saved registers are individually PUSHed onto the > stack before the rest of the stack space is allocated)." > > "Presumably this isn’t a very big deal, since there’s very few > situations where unwinding would involve a function still executing its > prologue/epilogue." > > Well, getting precise stack traces is a big deal and the users want them. **Shrink-wrapping and precise stack traces**: Yes, compact unwind handles these through an extension proposed by OpenVMS (not yet upstreamed to LLVM): https://lists.llvm.org/pipermail/llvm-dev/2018-January/120741.html Technical details of the extension: - A single unwind group describes a (prologue_part1, prologue_part2, body, epilogue) tuple. - The prologue is conceptually split into two parts: the first part extends up to and including the instruction that decreases RSP; the second part extends to a point after the last preserved register is saved but before any preserved register is modified (this location is not unique, providing flexibility). + When unwinding in the prologue, the RSP register value can be inferred from the PC and the set of saved registers. - Since register restoration is idempotent (restoring preserved registers multiple times during unwinding causes no harm), there is no need to describe `pop $reg` sequences. The unwind group needs just one bit to describe whether the 1-byte `ret` instruction is present. - The `length` field in the compact unwind group descriptor is repurposed to describe the prologue's two parts. - By composing multiple unwind groups, potentially with zero-sized prologues or omitting `ret` instructions in epilogues, it can describe functions with shrink wrapping or tail duplication optimization. - Null frame groups (with no prologue or epilogue) are the default and can describe trampolines and PLT stubs. Now, to compare this against SFrame's space efficiency for synchronous unwinding, I've built llvm-mc, opt, and clang with -fno-asynchronous-unwind-tables -funwind-tables across multiple build configurations (clang vs gcc, frame pointer vs sframe). The resulting .sframe section sizes are significant: % cat ~/tmp/test-unwind.sh #!/bin/zsh conf() { configure-llvm $@ -DCMAKE_EXE_LINKER_FLAGS='-pie -Wl,-z,pack-relative-relocs' -DLLVM_ENABLE_UNWIND_TABLES=on \ -DCMAKE_{EXE,SHARED}_LINKER_FLAGS=-fuse-ld=bfd -DLLVM_ENABLE_LLD=off } clang=-fno-integrated-as gcc=("-DCMAKE_C_COMPILER=$HOME/opt/gcc-15/bin/gcc" "-DCMAKE_CXX_COMPILER=$HOME/opt/gcc-15/bin/g++") fp="-fno-omit-frame-pointer -momit-leaf-frame-pointer -B$HOME/opt/binutils/bin -Wa,--gsframe=no -fno-asynchronous-unwind-tables -funwind-tables" sframe="-fomit-frame-pointer -momit-leaf-frame-pointer -B$HOME/opt/binutils/bin -Wa,--gsframe -fno-asynchronous-unwind-tables -funwind-tables" conf custom-fp-sync -DCMAKE_{C,CXX}_FLAGS="$clang $fp" conf custom-sframe-sync -DCMAKE_{C,CXX}_FLAGS="$clang $sframe" conf custom-fp-gcc-sync -DCMAKE_{C,CXX}_FLAGS="$fp" ${gcc[@]} conf custom-sframe-gcc-sync -DCMAKE_{C,CXX}_FLAGS="$sframe" ${gcc[@]} for i in fp-sync sframe-sync fp-gcc-sync sframe-gcc-sync; do ninja -C /tmp/out/custom-$i llvm-mc opt clang; done % ~/Dev/unwind-info-size-analyzer/section_size.rb /tmp/out/custom-{fp,sframe}-{,gcc-}sync/bin/{llvm-mc,opt,clang} Filename | .text size | EH size | .sframe size | VM size | VM increase --------------------------------------------+------------------+----------------+----------------+-----------+------------ /tmp/out/custom-fp-sync/bin/llvm-mc | 2124031 (23.5%) | 301136 (3.3%) | 0 (0.0%) | 9050149 | - /tmp/out/custom-sframe-sync/bin/llvm-mc | 2114383 (22.3%) | 367452 (3.9%) | 348235 (3.7%) | 9483621 | +4.8% /tmp/out/custom-fp-gcc-sync/bin/llvm-mc | 2744214 (29.2%) | 301836 (3.2%) | 0 (0.0%) | 9389677 | +3.8% /tmp/out/custom-sframe-gcc-sync/bin/llvm-mc | 2705860 (27.7%) | 354292 (3.6%) | 356073 (3.6%) | 9780985 | +8.1% /tmp/out/custom-fp-sync/bin/opt | 38873081 (69.9%) | 3538408 (6.4%) | 0 (0.0%) | 55598521 | - /tmp/out/custom-sframe-sync/bin/opt | 39011423 (62.4%) | 4557012 (7.3%) | 4452908 (7.1%) | 62494765 | +12.4% /tmp/out/custom-fp-gcc-sync/bin/opt | 54654535 (78.1%) | 3631076 (5.2%) | 0 (0.0%) | 70001573 | +25.9% /tmp/out/custom-sframe-gcc-sync/bin/opt | 53644831 (70.4%) | 4857220 (6.4%) | 5263530 (6.9%) | 76205733 | +37.1% /tmp/out/custom-fp-sync/bin/clang | 68345753 (73.8%) | 6643384 (7.2%) | 0 (0.0%) | 92638305 | - /tmp/out/custom-sframe-sync/bin/clang | 68500319 (64.9%) | 8684540 (8.2%) | 8521760 (8.1%) | 105572021 | +14.0% /tmp/out/custom-fp-gcc-sync/bin/clang | 96515079 (82.8%) | 6556756 (5.6%) | 0 (0.0%) | 116524565 | +25.8% /tmp/out/custom-sframe-gcc-sync/bin/clang | 94583903 (74.0%) | 8817628 (6.9%) | 9696263 (7.6%) | 127839309 | +38.0% Note: in GCC FP builds, .text is larger due to missing optimization for RBP-based frames (e.g. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108386). Once this optimization is implemented, GCC FP builds should actually have smaller .text than RSP-based builds, because RBP-relative addressing produces more compact encodings than RSP-relative addressing (which requires an extra SIB byte). .sframe for sync is not noticeably smaller than that for async. This is probably because there are still many DW_CFA_advance_loc ops even in -fno-asynchronous-unwind-tables -funwind-tables builds. ``` % ~/Dev/bloaty/out/release/bloaty /tmp/out/custom-sframe-gcc-sync/bin/clang FILE SIZE VM SIZE -------------- -------------- 64.0% 90.2Mi 74.0% 90.2Mi .text 10.9% 15.4Mi 0.0% 0 .strtab 7.0% 9.92Mi 8.1% 9.92Mi .rodata 6.6% 9.25Mi 7.6% 9.25Mi .sframe 5.2% 7.38Mi 6.1% 7.38Mi .eh_frame 2.9% 4.14Mi 0.0% 0 .symtab 1.4% 1.94Mi 1.6% 1.94Mi .data.rel.ro 0.9% 1.23Mi 1.0% 1.23Mi [LOAD #4 [R]] 0.7% 1.03Mi 0.8% 1.03Mi .eh_frame_hdr 0.0% 0 0.5% 636Ki .bss 0.2% 298Ki 0.2% 298Ki .data 0.0% 23.1Ki 0.0% 23.1Ki .rela.dyn 0.0% 10.5Ki 0.0% 0 [Unmapped] 0.0% 9.04Ki 0.0% 9.04Ki .dynstr 0.0% 8.79Ki 0.0% 8.79Ki .dynsym 0.0% 7.31Ki 0.0% 7.31Ki .rela.plt 0.0% 6.42Ki 0.0% 3.98Ki [20 Others] 0.0% 4.89Ki 0.0% 4.89Ki .plt 0.0% 3.55Ki 0.0% 3.50Ki .init_array 0.0% 2.50Ki 0.0% 2.50Ki .hash 0.0% 2.46Ki 0.0% 2.46Ki .got.plt 100.0% 140Mi 100.0% 121Mi TOTAL ``` Here is an aarch64 build: cmake -GNinja -Sllvm -B/tmp/out/a64-sframe -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=aarch64-linux-gnu-gcc -DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++ -DLLVM_HOST_TRIPLE=aarch64-linux-gnu -DLLVM_TARGETS_TO_BUILD=AArch64 -DLLVM_ENABLE_PLUGINS=off -DCMAKE_EXE_LINKER_FLAGS='-no-pie -B$HOME/opt/binutils-aarch64/bin' -DCMAKE_{C,CXX}_FLAGS="-fomit-frame-pointer -momit-leaf-frame-pointer -B$HOME/opt/binutils-aarch64/bin -Wa,--gsframe" -DLLVM_NATIVE_TOOL_DIR=/tmp/out/custom-fp-gcc-sync/bin -DLLVM_ENABLE_PROJECTS=clang % ~/Dev/bloaty/out/release/bloaty /tmp/out/a64-sframe/bin/clang FILE SIZE VM SIZE -------------- -------------- 60.0% 71.8Mi 73.2% 71.8Mi .text 12.3% 14.8Mi 0.0% 0 .strtab 8.0% 9.53Mi 9.7% 9.53Mi .rodata 6.2% 7.39Mi 0.0% 0 .symtab 5.8% 6.93Mi 7.1% 6.93Mi .eh_frame 4.2% 5.01Mi 5.1% 5.01Mi .sframe 1.7% 2.00Mi 2.0% 2.00Mi .data.rel.ro 0.8% 1.01Mi 1.0% 1.01Mi [LOAD #2 [RX]] 0.8% 932Ki 0.9% 932Ki .eh_frame_hdr 0.0% 0 0.6% 599Ki .bss 0.2% 294Ki 0.3% 294Ki .data 0.0% 40.2Ki 0.0% 40.2Ki .got 0.0% 20.6Ki 0.0% 0 [Unmapped] 0.0% 9.19Ki 0.0% 9.19Ki .dynstr 0.0% 8.51Ki 0.0% 8.51Ki .dynsym 0.0% 7.41Ki 0.0% 7.41Ki .rela.plt 0.0% 4.97Ki 0.0% 4.97Ki .plt 0.0% 4.37Ki 0.0% 4.07Ki [17 Others] 0.0% 3.35Ki 0.0% 3.30Ki .init_array 0.0% 2.49Ki 0.0% 2.49Ki .got.plt 0.0% 2.06Ki 0.0% 0 [ELF Section Headers] > > (On macOS you can check the section size with objdump --arch x86_64 - > > h clang and dump the unwind info with objdump --arch x86_64 --unwind- > > info clang) > > > > OpenVMS's x86-64 port, which is ELF-based, also adopted this format as > > documented in their "VSI OpenVMS Calling Standard" and their 2018 post: > > https://discourse.llvm.org/t/rfc-asynchronous-unwind-tables-attribute/59282 > > > > The compact unwind format achieves this efficiency through a two-level > > page table structure. It describes common frame layouts compactly and > > falls back to DWARF only when necessary, allowing most DWARF CFI entries > > to be eliminated while maintaining full functionality. For more details, > > see: https://faultlore.com/blah/compact-unwinding/ and the lld/MachO > > implemention https://github.com/llvm/llvm-project/blob/main/lld/MachO/ > > UnwindInfoSection.cpp > > > > How does your vision of "linker-friendly" stack tracing/stack unwinding > format reconcile with these suggested approaches ? As far as I can tell, > these formats also require linker created indexes and are > non-concatenable (custom handling in every linker). Something you've > had "significant concerns" about. > We can distinguish between linking-time and execution-time representations by using different section names. The OpenVMS specification says: It is useful to note that the run-time representation of unwind information can vary from little more than a simple concatenation of the compile-time information to a substantial rewriting of unwind information by the linker. The proposal favors simple concatenation while maintaining the same ordering of groups as their associated code. The runtime library can build this index at runtime and cache it to disk. Once the design becomes sufficiently stable, we can introduce an opt-in linker option --xxxxframe-index that builds an index from recognized format versions while reporting warnings for unrecognized ones. We need to carefully design this mechanism to be stable and robust, avoiding frequent format updates. > From > https://docs.vmssoftware.com/vsi-openvms-calling-standard/#STACK_UNWIND_EXCEPTION_X86_64: > "The unwind dispatch table (see Section B.3.1, ''Unwind Dispatch > Table'') is created by the linker using information in the unwind > descriptors (see Section B.3.2, ''DWARF Unwind Descriptors'' and Section > B.3.3, ''Compact Unwind Description'') provided by compilers. The linker > may use the provided unwind descriptors directly or replace them with > equivalent optimized forms based on its optimization strategies." > > Above all, do users want a solution which requires falling back on > DWARF-based processing for precise stack tracing ? The key distinction is that compact unwind handles the vast majority of functions without DWARF—the macOS measurements show __unwind_info at 0.6% of __text size with __eh_frame reduced to negligible size (0x58 bytes). While SFrame also cannot handle all frames, compact unwind achieves dramatic size reductions by making DWARF the exception rather than requiring it alongside a supplementary format. The DWARF fallback provides flexibility for additional coverage when needed, but nothing is lost (at least for the clang binary on macOS) if DWARF fallback were disabled in a hypothetical future linux-perf implementation. > > **The AArch64 case: size matters even more** > > > > The size consideration becomes even more critical for AArch64, which is > > heavily deployed on mobile phones. > > There's an active feature request for compact unwind support in the > > AArch64 ABI: https://github.com/ARM-software/abi-aa/issues/344 > > This underscores the broader industry need for efficient unwind > > information that doesn't duplicate data or significantly increase binary > > size. > > > > Our measurements with a dataset of about 1400 userspace artifacts > (binaries and shared libraries) show that the SFrame/(EH Frame + EH > Frame HDR) ratio is: > - Average of 0.70 on AArch64. > - Average of 1.00 on x86_64. > > Projecting the size of what you observe for clang binary on x86_64 to > conclude the size ratio on AArch64 is not very wise to do. > > Whether the size impact is worth the benefit: its a choice for users to > make. SFrame offers the users fast, precise stack traces with simple > stack tracers. Thank you for providing the AArch64 measurements. Even with a 0.70x ratio on AArch64, this represents substantial memory overhead when considering: .eh_frame is already large and being complained about. Being unable to eliminate it (needed for debugging and C++ exceptions) and adding 0.70x more means significant additional overhead for users. > > There are at least two formats the ELF one can learn from: LLVM's > > compact unwind format (aarch64) and Windows ARM64 Frame Unwind Code. > > > > Please, if you have any concrete suggestions (keeping the above goals in > mind), you already know how/where to engage. I've provided concrete suggestions throughout this discussion. > > **Path forward** > > > > Unless SFrame can actually replace .eh_frame (rather than supplementing > > it as an accelerator for linux-perf) and demonstrate sizes smaller > > than .eh_frame - matching the efficiency of existing compact unwind > > approaches — I question its practical viability for userspace. > > The current design appears to add overhead rather than reduce it. > > This isn't to suggest we should simply adopt the existing compact unwind > > format wholesale. > > The x86-64 design dates back to 2009 or earlier, and there are likely > > improvements we can make. However, we should aim for similar or better > > efficiency gains. > > > > For additional context, I've documented my detailed analysis at: > > > > - https://maskray.me/blog/2025-09-28-remarks-on-sframe (covering > > mandatory index building problems, section group compliance and garbage > > collection issues, and version compatibility challenges) > > GC issue is a bug currently tracked and with a target milestone of 2.46. > > > - https://maskray.me/blog/2025-10-26-stack-walking-space-and-time-trade- > > offs (size analysis) > > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-11-05 8:21 ` Fangrui Song @ 2025-11-06 0:44 ` Indu Bhagat 2025-11-06 7:51 ` Florian Weimer 2025-11-06 9:20 ` Fangrui Song 0 siblings, 2 replies; 30+ messages in thread From: Indu Bhagat @ 2025-11-06 0:44 UTC (permalink / raw) To: Fangrui Song; +Cc: linux-toolchains, linux-perf-users, linux-kernel On 11/5/25 12:21 AM, Fangrui Song wrote: >> On Tue, Nov 4, 2025 at 1:21 AM Indu <indu.bhagat@oracle.com> wrote: >> On 2025-10-29 11:53 p.m., Fangrui Song wrote: >>> I've been following the SFrame discussion and wanted to share some >>> concerns about its viability for userspace adoption, based on concrete >>> measurements and comparison with existing compact unwind implementations >>> in LLVM. >>> >>> **Size overhead concerns** >>> >>> Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is >>> approximately 10% larger than the combined size of .eh_frame >>> and .eh_frame_hdr (8.06 MiB total). >>> This is problematic because .eh_frame cannot be eliminated - it contains >>> essential information for restoring callee-saved registers, LSDA, and >>> personality information needed for debugging (e.g. reading local >>> variables in a coredump) and C++ exception handling. >>> >>> This means adopting SFrame would result in carrying both formats, with a >>> large net size increase. >>> >>> **Learning from existing compact unwind implementations** >>> >>> It's worth noting that LLVM has had a battle-tested compact unwind >>> format in production use since 2009 with OS X 10.6, which transitioned >>> to using CFI directives in 2013 [1]. The efficiency gains are dramatic: >>> >>> __text section: 0x4a55470 bytes >>> __unwind_info section: 0x79060 bytes (0.6% of __text) >>> __eh_frame section: 0x58 bytes >>> >> >> I believe this is only synchronous? If yes, do you think this is a fair >> measurement to compare against ? >> >> Does the compact unwind info scheme work well for cases of >> shrink-wrapping ? How about the case of AArch64, where the ABI does not >> mandate if and where frame record is created ? >> >> For the numbers above, does it ensure precise stack traces ? >> >> From the The Apple Compact Unwinding Format document >> (https://faultlore.com/blah/compact-unwinding/), >> "One consequence of only having one opcode for a whole function is that >> functions will generally have incorrect instructions for the function’s >> prologue (where callee-saved registers are individually PUSHed onto the >> stack before the rest of the stack space is allocated)." >> >> "Presumably this isn’t a very big deal, since there’s very few >> situations where unwinding would involve a function still executing its >> prologue/epilogue." >> >> Well, getting precise stack traces is a big deal and the users want them. > > **Shrink-wrapping and precise stack traces**: Yes, compact unwind > handles these through an extension proposed by OpenVMS (not yet > upstreamed to LLVM): > https://lists.llvm.org/pipermail/llvm-dev/2018-January/120741.html > Thanks for the link. The above questions were strictly in the context of the battle-tested "The Apple Compact Unwinding Format" in production in the lld/MachO implementation, not for the proposed OpenVMS extensions. Is it possible to get answers to those questions with that context in place? If shrink-wrapping and precise stack traces isnt supported without the OpenVMS extension (that is not yet implemented), arent we comparing apples vs pears here ? > Technical details of the extension: > > - A single unwind group describes a (prologue_part1, prologue_part2, > body, epilogue) tuple. > - The prologue is conceptually split into two parts: the first part > extends up to and including the instruction that decreases RSP; the > second part extends to a point after the last preserved register is > saved but before any preserved register is modified (this location is > not unique, providing flexibility). > + When unwinding in the prologue, the RSP register value can be > inferred from the PC and the set of saved registers. > - Since register restoration is idempotent (restoring preserved > registers multiple times during unwinding causes no harm), there is no > need to describe `pop $reg` sequences. The unwind group needs just one > bit to describe whether the 1-byte `ret` instruction is present. Is this true for the case of asynchronous stack tracing too ? > - The `length` field in the compact unwind group descriptor is > repurposed to describe the prologue's two parts. > - By composing multiple unwind groups, potentially with zero-sized > prologues or omitting `ret` instructions in epilogues, it can describe > functions with shrink wrapping or tail duplication optimization. > - Null frame groups (with no prologue or epilogue) are the default and > can describe trampolines and PLT stubs. PLT stubs may use stack (push to stack). As per the document "A null frame (MODE = 8) is the simplest possible frame, with no allocated stack of either kind (hence no saved registers)". So null frame can be used for PLT only if the functions invoking the PLT stub were using an RBP-based frame. Isnt it ? BTW, but both EH Frame and SFrame have specific, condensed representation for metadata for PLT entries. > Anyway, thanks for the summary. I see that OpenVMS extension for asynchronous compact unwind descriptors is an RFC state ATM. But few important observations and questions: - As noted in the recently revived discussion, https://discourse.llvm.org/t/rfc-improving-compact-x86-64-compact-unwind-descriptors/47471, there is going to be a *non-negligible* size overhead as soon as you move towards a specification for asynchronous (vs the current specification that caters to synchronous only). Now add to it, the quirks of each architecture/ABI :). Any comments ? - From the document: "Use of any preserved register must be delayed until all of the preserved registers have been saved." Q: Does this work well with optimizing compilers ? Is this an ABI change being asked for multiple architectures ? - From the document: "It appears technically feasible for a null frame function to have a personality routine. However, the utility of such a capability seems too meager to justify allowing this. We propose to not support this." and "If the first attempt to lookup an unwind group for an exception address fails, then it is (tentatively) assumed to have occurred within a null frame function or in a part of a function that is adequately described by a null frame. The presumed return address is (virtually or actually) popped from the top of stack and looked up. This second attempted lookup must succeed, in which case processing continues normally. A failure is a fatal error." Q: Is this a problem, especially because the goal is to evolve the OpenVMS RFC proposal is subsume .eh_frame ? Are there people actively working towards bringing this to fruition? > Now, to compare this against SFrame's space efficiency for synchronous > unwinding, I've built llvm-mc, opt, and clang with > -fno-asynchronous-unwind-tables -funwind-tables across multiple build > configurations (clang vs gcc, frame pointer vs sframe). The resulting > .sframe section sizes are significant: > > % cat ~/tmp/test-unwind.sh > #!/bin/zsh > conf() { > configure-llvm $@ -DCMAKE_EXE_LINKER_FLAGS='-pie > -Wl,-z,pack-relative-relocs' -DLLVM_ENABLE_UNWIND_TABLES=on \ > -DCMAKE_{EXE,SHARED}_LINKER_FLAGS=-fuse-ld=bfd -DLLVM_ENABLE_LLD=off > } > > clang=-fno-integrated-as > gcc=("-DCMAKE_C_COMPILER=$HOME/opt/gcc-15/bin/gcc" > "-DCMAKE_CXX_COMPILER=$HOME/opt/gcc-15/bin/g++") > > fp="-fno-omit-frame-pointer -momit-leaf-frame-pointer > -B$HOME/opt/binutils/bin -Wa,--gsframe=no > -fno-asynchronous-unwind-tables -funwind-tables" > sframe="-fomit-frame-pointer -momit-leaf-frame-pointer > -B$HOME/opt/binutils/bin -Wa,--gsframe -fno-asynchronous-unwind-tables > -funwind-tables" > > conf custom-fp-sync -DCMAKE_{C,CXX}_FLAGS="$clang $fp" > conf custom-sframe-sync -DCMAKE_{C,CXX}_FLAGS="$clang $sframe" > conf custom-fp-gcc-sync -DCMAKE_{C,CXX}_FLAGS="$fp" ${gcc[@]} > conf custom-sframe-gcc-sync -DCMAKE_{C,CXX}_FLAGS="$sframe" ${gcc[@]} > > for i in fp-sync sframe-sync fp-gcc-sync sframe-gcc-sync; do ninja -C > /tmp/out/custom-$i llvm-mc opt clang; done > > % ~/Dev/unwind-info-size-analyzer/section_size.rb > /tmp/out/custom-{fp,sframe}-{,gcc-}sync/bin/{llvm-mc,opt,clang} > Filename | .text size | > EH size | .sframe size | VM size | VM increase > --------------------------------------------+------------------+----------------+----------------+-----------+------------ > /tmp/out/custom-fp-sync/bin/llvm-mc | 2124031 (23.5%) | > 301136 (3.3%) | 0 (0.0%) | 9050149 | - > /tmp/out/custom-sframe-sync/bin/llvm-mc | 2114383 (22.3%) | > 367452 (3.9%) | 348235 (3.7%) | 9483621 | +4.8% > /tmp/out/custom-fp-gcc-sync/bin/llvm-mc | 2744214 (29.2%) | > 301836 (3.2%) | 0 (0.0%) | 9389677 | +3.8% > /tmp/out/custom-sframe-gcc-sync/bin/llvm-mc | 2705860 (27.7%) | > 354292 (3.6%) | 356073 (3.6%) | 9780985 | +8.1% > /tmp/out/custom-fp-sync/bin/opt | 38873081 (69.9%) | > 3538408 (6.4%) | 0 (0.0%) | 55598521 | - > /tmp/out/custom-sframe-sync/bin/opt | 39011423 (62.4%) | > 4557012 (7.3%) | 4452908 (7.1%) | 62494765 | +12.4% > /tmp/out/custom-fp-gcc-sync/bin/opt | 54654535 (78.1%) | > 3631076 (5.2%) | 0 (0.0%) | 70001573 | +25.9% > /tmp/out/custom-sframe-gcc-sync/bin/opt | 53644831 (70.4%) | > 4857220 (6.4%) | 5263530 (6.9%) | 76205733 | +37.1% > /tmp/out/custom-fp-sync/bin/clang | 68345753 (73.8%) | > 6643384 (7.2%) | 0 (0.0%) | 92638305 | - > /tmp/out/custom-sframe-sync/bin/clang | 68500319 (64.9%) | > 8684540 (8.2%) | 8521760 (8.1%) | 105572021 | +14.0% > /tmp/out/custom-fp-gcc-sync/bin/clang | 96515079 (82.8%) | > 6556756 (5.6%) | 0 (0.0%) | 116524565 | +25.8% > /tmp/out/custom-sframe-gcc-sync/bin/clang | 94583903 (74.0%) | > 8817628 (6.9%) | 9696263 (7.6%) | 127839309 | +38.0% > > Note: in GCC FP builds, .text is larger due to missing optimization > for RBP-based frames (e.g. > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108386). Once this > optimization is implemented, GCC FP builds should actually have > smaller .text than RSP-based builds, because RBP-relative addressing > produces more compact encodings than RSP-relative addressing (which > requires an extra SIB byte). > > .sframe for sync is not noticeably smaller than that for async. This > is probably because > there are still many DW_CFA_advance_loc ops even in > -fno-asynchronous-unwind-tables -funwind-tables builds. > Possible that its because in the Apple Compact Unwind Format, the linker optimizes compact unwind descriptors into the three-level paged structure, effectively de-duplicating some content. >>> (On macOS you can check the section size with objdump --arch x86_64 - >>> h clang and dump the unwind info with objdump --arch x86_64 --unwind- >>> info clang) >>> >>> OpenVMS's x86-64 port, which is ELF-based, also adopted this format as >>> documented in their "VSI OpenVMS Calling Standard" and their 2018 post: >>> https://discourse.llvm.org/t/rfc-asynchronous-unwind-tables-attribute/59282 >>> >>> The compact unwind format achieves this efficiency through a two-level >>> page table structure. It describes common frame layouts compactly and >>> falls back to DWARF only when necessary, allowing most DWARF CFI entries >>> to be eliminated while maintaining full functionality. For more details, >>> see: https://faultlore.com/blah/compact-unwinding/ and the lld/MachO >>> implemention https://github.com/llvm/llvm-project/blob/main/lld/MachO/ >>> UnwindInfoSection.cpp >>> >> >> How does your vision of "linker-friendly" stack tracing/stack unwinding >> format reconcile with these suggested approaches ? As far as I can tell, >> these formats also require linker created indexes and are >> non-concatenable (custom handling in every linker). Something you've >> had "significant concerns" about. >> This question is unanswered: What do you think about "linker-friendliness" of the current implementation of the lld/MachO implementation of the compact unwind format in LLVM ? > > We can distinguish between linking-time and execution-time > representations by using different section names. > The OpenVMS specification says: > > It is useful to note that the run-time representation of unwind > information can vary from little more than a simple concatenation of > the compile-time information to a substantial rewriting of unwind > information by the linker. The proposal favors simple concatenation > while maintaining the same ordering of groups as their associated > code. > > The runtime library can build this index at runtime and cache it to disk. > This will include the dynamic linker and the stack tracer in the Linux kernel (the latter when stack tracing user space stacks). Do you think this is feasible ? > Once the design becomes sufficiently stable, we can introduce an > opt-in linker option --xxxxframe-index that builds an index from > recognized format versions while reporting warnings for unrecognized > ones.> We need to carefully design this mechanism to be stable and robust, > avoiding frequent format updates. > >> From >> https://docs.vmssoftware.com/vsi-openvms-calling-standard/#STACK_UNWIND_EXCEPTION_X86_64: >> "The unwind dispatch table (see Section B.3.1, ''Unwind Dispatch >> Table'') is created by the linker using information in the unwind >> descriptors (see Section B.3.2, ''DWARF Unwind Descriptors'' and Section >> B.3.3, ''Compact Unwind Description'') provided by compilers. The linker >> may use the provided unwind descriptors directly or replace them with >> equivalent optimized forms based on its optimization strategies." >> >> Above all, do users want a solution which requires falling back on >> DWARF-based processing for precise stack tracing ? > > The key distinction is that compact unwind handles the vast majority > of functions without DWARF—the macOS measurements show __unwind_info > at 0.6% of __text size with __eh_frame reduced to negligible size > (0x58 bytes). While SFrame also cannot handle all frames, compact > unwind achieves dramatic size reductions by making DWARF the exception > rather than requiring it alongside a supplementary format. > As we have tried to reason, this is a misleading comparison. The compact unwind tables format: - needs to be extended for asynchronous stack unwinding - needs to be extended for other ABI/architectures - Making it concatenable / linker-friendly will also likely impose some negative effects on size. > The DWARF fallback provides flexibility for additional coverage when > needed, but nothing is lost (at least for the clang binary on macOS) > if DWARF fallback were disabled in a hypothetical future linux-perf > implementation. > Fair enough, thats something for linux-perf/kernel to decide. Once the OpenVMS RFC is sufficiently shaped to become a viable replacement for .eh_frame, this question will be for the stakeholders to decide. >>> **The AArch64 case: size matters even more** >>> >>> The size consideration becomes even more critical for AArch64, which is >>> heavily deployed on mobile phones. >>> There's an active feature request for compact unwind support in the >>> AArch64 ABI: https://github.com/ARM-software/abi-aa/issues/344 >>> This underscores the broader industry need for efficient unwind >>> information that doesn't duplicate data or significantly increase binary >>> size. >>> >> >> Our measurements with a dataset of about 1400 userspace artifacts >> (binaries and shared libraries) show that the SFrame/(EH Frame + EH >> Frame HDR) ratio is: >> - Average of 0.70 on AArch64. >> - Average of 1.00 on x86_64. >> >> Projecting the size of what you observe for clang binary on x86_64 to >> conclude the size ratio on AArch64 is not very wise to do. >> >> Whether the size impact is worth the benefit: its a choice for users to >> make. SFrame offers the users fast, precise stack traces with simple >> stack tracers. > > Thank you for providing the AArch64 measurements. Even with a 0.70x ratio on > AArch64, this represents substantial memory overhead when considering: > > .eh_frame is already large and being complained about. > Being unable to eliminate it (needed for debugging and C++ exceptions) > and adding 0.70x more means significant additional overhead for users. > >>> There are at least two formats the ELF one can learn from: LLVM's >>> compact unwind format (aarch64) and Windows ARM64 Frame Unwind Code. >>> >> >> Please, if you have any concrete suggestions (keeping the above goals in >> mind), you already know how/where to engage. > > I've provided concrete suggestions throughout this discussion. > Apologies, I should have been more precise. And I ask because you know the details about both SFrame and the variants of Compact Unwind Descriptor formats at this point :). If you have concrete suggestions to improve the SFrame format for size, please let us know. >>> **Path forward** >>> >>> Unless SFrame can actually replace .eh_frame (rather than supplementing >>> it as an accelerator for linux-perf) and demonstrate sizes smaller >>> than .eh_frame - matching the efficiency of existing compact unwind >>> approaches — I question its practical viability for userspace. >>> The current design appears to add overhead rather than reduce it. >>> This isn't to suggest we should simply adopt the existing compact unwind >>> format wholesale. >>> The x86-64 design dates back to 2009 or earlier, and there are likely >>> improvements we can make. However, we should aim for similar or better >>> efficiency gains. >>> >>> For additional context, I've documented my detailed analysis at: >>> >>> - https://maskray.me/blog/2025-09-28-remarks-on-sframe (covering >>> mandatory index building problems, section group compliance and garbage >>> collection issues, and version compatibility challenges) >> >> GC issue is a bug currently tracked and with a target milestone of 2.46. >> >>> - https://maskray.me/blog/2025-10-26-stack-walking-space-and-time-trade- >>> offs (size analysis) >>> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-11-06 0:44 ` Indu Bhagat @ 2025-11-06 7:51 ` Florian Weimer 2025-11-06 21:02 ` Indu Bhagat 2025-11-06 9:20 ` Fangrui Song 1 sibling, 1 reply; 30+ messages in thread From: Florian Weimer @ 2025-11-06 7:51 UTC (permalink / raw) To: Indu Bhagat Cc: Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel * Indu Bhagat: > PLT stubs may use stack (push to stack). As per the document "A null > frame (MODE = 8) is the simplest possible frame, with no allocated > stack of either kind (hence no saved registers)". So null frame can > be used for PLT only if the functions invoking the PLT stub were using > an RBP-based frame. Isnt it ? I think I said this before, but I don't think new toolchain features need to support lazy binding. Without lazy bindings, the PLT stubs do not change the stack pointer or frame pointer and just make a tail call. Do you see a need for continued support of lazy binding? Thanks, Florian ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-11-06 7:51 ` Florian Weimer @ 2025-11-06 21:02 ` Indu Bhagat 0 siblings, 0 replies; 30+ messages in thread From: Indu Bhagat @ 2025-11-06 21:02 UTC (permalink / raw) To: Florian Weimer Cc: Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel On 11/5/25 11:51 PM, Florian Weimer wrote: > * Indu Bhagat: > >> PLT stubs may use stack (push to stack). As per the document "A null >> frame (MODE = 8) is the simplest possible frame, with no allocated >> stack of either kind (hence no saved registers)". So null frame can >> be used for PLT only if the functions invoking the PLT stub were using >> an RBP-based frame. Isnt it ? > > I think I said this before, but I don't think new toolchain features > need to support lazy binding. Without lazy bindings, the PLT stubs do > not change the stack pointer or frame pointer and just make a tail call. > > Do you see a need for continued support of lazy binding? > (Yes, you did mention this before in another thread on Binutils.) My thinking has been: some linker emulations default to lazy (I guess the reason is changing the default is difficult). So, users may end up continuing with lazy bindings unknowingly ? But I guess not designing new toolchain features to support lazy binding seems reasonable. Thanks ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-11-06 0:44 ` Indu Bhagat 2025-11-06 7:51 ` Florian Weimer @ 2025-11-06 9:20 ` Fangrui Song 2025-11-06 20:42 ` Indu Bhagat 1 sibling, 1 reply; 30+ messages in thread From: Fangrui Song @ 2025-11-06 9:20 UTC (permalink / raw) To: Indu Bhagat Cc: Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel On Wed, Nov 5, 2025 at 4:45 PM Indu Bhagat <indu.bhagat@oracle.com> wrote: > > On 11/5/25 12:21 AM, Fangrui Song wrote: > >> On Tue, Nov 4, 2025 at 1:21 AM Indu <indu.bhagat@oracle.com> wrote: > >> On 2025-10-29 11:53 p.m., Fangrui Song wrote: > >>> I've been following the SFrame discussion and wanted to share some > >>> concerns about its viability for userspace adoption, based on concrete > >>> measurements and comparison with existing compact unwind implementations > >>> in LLVM. > >>> > >>> **Size overhead concerns** > >>> > >>> Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is > >>> approximately 10% larger than the combined size of .eh_frame > >>> and .eh_frame_hdr (8.06 MiB total). > >>> This is problematic because .eh_frame cannot be eliminated - it contains > >>> essential information for restoring callee-saved registers, LSDA, and > >>> personality information needed for debugging (e.g. reading local > >>> variables in a coredump) and C++ exception handling. > >>> > >>> This means adopting SFrame would result in carrying both formats, with a > >>> large net size increase. > >>> > >>> **Learning from existing compact unwind implementations** > >>> > >>> It's worth noting that LLVM has had a battle-tested compact unwind > >>> format in production use since 2009 with OS X 10.6, which transitioned > >>> to using CFI directives in 2013 [1]. The efficiency gains are dramatic: > >>> > >>> __text section: 0x4a55470 bytes > >>> __unwind_info section: 0x79060 bytes (0.6% of __text) > >>> __eh_frame section: 0x58 bytes > >>> > >> > >> I believe this is only synchronous? If yes, do you think this is a fair > >> measurement to compare against ? > >> > >> Does the compact unwind info scheme work well for cases of > >> shrink-wrapping ? How about the case of AArch64, where the ABI does not > >> mandate if and where frame record is created ? > >> > >> For the numbers above, does it ensure precise stack traces ? > >> > >> From the The Apple Compact Unwinding Format document > >> (https://faultlore.com/blah/compact-unwinding/), > >> "One consequence of only having one opcode for a whole function is that > >> functions will generally have incorrect instructions for the function’s > >> prologue (where callee-saved registers are individually PUSHed onto the > >> stack before the rest of the stack space is allocated)." > >> > >> "Presumably this isn’t a very big deal, since there’s very few > >> situations where unwinding would involve a function still executing its > >> prologue/epilogue." > >> > >> Well, getting precise stack traces is a big deal and the users want them. > > > > **Shrink-wrapping and precise stack traces**: Yes, compact unwind > > handles these through an extension proposed by OpenVMS (not yet > > upstreamed to LLVM): > > https://lists.llvm.org/pipermail/llvm-dev/2018-January/120741.html > > > > Thanks for the link. > > The above questions were strictly in the context of the battle-tested > "The Apple Compact Unwinding Format" in production in the lld/MachO > implementation, not for the proposed OpenVMS extensions. > > Is it possible to get answers to those questions with that context in place? > > If shrink-wrapping and precise stack traces isnt supported without the > OpenVMS extension (that is not yet implemented), arent we comparing > apples vs pears here ? You're right to ask for clarification. The extended compact unwind information works with shrink wrapping. For context, a FDE in .eh_frame costs at least 20 bytes (often 30+), plus its associated .eh_frame_hdr entry costs 8 bytes. Even a larger compact unwind descriptor at 8 bytes yields significant savings compared to .eh_frame. Tripling that to 24 bytes is still a substantial win. Additionally, very few functions benefit from shrink wrapping optimization. When needed, we require multiple unwind description records (typically 3). > > Technical details of the extension: > > > > - A single unwind group describes a (prologue_part1, prologue_part2, > > body, epilogue) tuple. > > - The prologue is conceptually split into two parts: the first part > > extends up to and including the instruction that decreases RSP; the > > second part extends to a point after the last preserved register is > > saved but before any preserved register is modified (this location is > > not unique, providing flexibility). > > + When unwinding in the prologue, the RSP register value can be > > inferred from the PC and the set of saved registers. > > - Since register restoration is idempotent (restoring preserved > > registers multiple times during unwinding causes no harm), there is no > > need to describe `pop $reg` sequences. The unwind group needs just one > > bit to describe whether the 1-byte `ret` instruction is present. > > Is this true for the case of asynchronous stack tracing too ? Yes. I believe it means the epilogue mirrors the prologue. Since we know which registers were saved in the prologue, we can infer the pop instructions in the epilogue and compute the SP offset when unwinding in the middle of an epilogue. > > - The `length` field in the compact unwind group descriptor is > > repurposed to describe the prologue's two parts. > > - By composing multiple unwind groups, potentially with zero-sized > > prologues or omitting `ret` instructions in epilogues, it can describe > > functions with shrink wrapping or tail duplication optimization. > > - Null frame groups (with no prologue or epilogue) are the default and > > can describe trampolines and PLT stubs. > > PLT stubs may use stack (push to stack). As per the document "A null > frame (MODE = 8) is the simplest possible frame, with no allocated stack > of either kind (hence no saved registers)". So null frame can be used > for PLT only if the functions invoking the PLT stub were using an > RBP-based frame. Isnt it ? > BTW, but both EH Frame and SFrame have specific, condensed > representation for metadata for PLT entries. A profiler can trivially retrieve the return address using the default rule: if a code region is not covered by metadata, assume the return address is available at *rsp (x86-64) or in the link register (most other architectures). This ld-generated unwind info feature is largely obsolete nowadays due to the prevailing use of -Wl,-z,relro,-z,now (BIND_NOW). PLT entries behave as functions without a prologue, so a profiler can trivially retrieve the return address using the default unwinding rule. > > > > Anyway, thanks for the summary. > > I see that OpenVMS extension for asynchronous compact unwind descriptors > is an RFC state ATM. But few important observations and questions: > > - As noted in the recently revived discussion, > https://discourse.llvm.org/t/rfc-improving-compact-x86-64-compact-unwind-descriptors/47471, > there is going to be a *non-negligible* size overhead as soon as you > move towards a specification for asynchronous (vs the current > specification that caters to synchronous only). Now add to it, the > quirks of each architecture/ABI :). Any comments ? As mentioned, even a larger compact unwind descriptor at 8 bytes yields significant savings compared to .eh_frame, and is also substantially smaller than SFrame. > - From the document: "Use of any preserved register must be delayed > until all of the preserved registers have been saved." > Q: Does this work well with optimizing compilers ? Is this an ABI > change being asked for multiple architectures ? I think this is about support for callee-saved registers, a feature SFrame doesn't have. I need to think about the details, but this thread is probably not the best place to discuss them. > - From the document: "It appears technically feasible for a null frame > function to have a personality routine. However, the utility of such a > capability seems too meager to justify allowing this. We propose to not > support this." and "If the first attempt to lookup an unwind group for > an exception address fails, then it is (tentatively) assumed to have > occurred within a null frame function or in a part of a function > that is adequately described by a null frame. The presumed return > address is (virtually or actually) popped from the top of stack and > looked up. This second attempted lookup must succeed, in which case > processing continues normally. A failure is a fatal error." > Q: Is this a problem, especially because the goal is to evolve the > OpenVMS RFC proposal is subsume .eh_frame ? I think this just hard-encodes the default rule, similar to what SFrame does: "AMD64 ABI mandates that the RA be saved at a fixed offset from the CFA when entering a new function." While I haven't given this much thought yet, I don't think this introduces problems that SFrame doesn't have. > Are there people actively working towards bringing this to fruition? > > > Now, to compare this against SFrame's space efficiency for synchronous > > unwinding, I've built llvm-mc, opt, and clang with > > -fno-asynchronous-unwind-tables -funwind-tables across multiple build > > configurations (clang vs gcc, frame pointer vs sframe). The resulting > > .sframe section sizes are significant: > > > > % cat ~/tmp/test-unwind.sh > > #!/bin/zsh > > conf() { > > configure-llvm $@ -DCMAKE_EXE_LINKER_FLAGS='-pie > > -Wl,-z,pack-relative-relocs' -DLLVM_ENABLE_UNWIND_TABLES=on \ > > -DCMAKE_{EXE,SHARED}_LINKER_FLAGS=-fuse-ld=bfd -DLLVM_ENABLE_LLD=off > > } > > > > clang=-fno-integrated-as > > gcc=("-DCMAKE_C_COMPILER=$HOME/opt/gcc-15/bin/gcc" > > "-DCMAKE_CXX_COMPILER=$HOME/opt/gcc-15/bin/g++") > > > > fp="-fno-omit-frame-pointer -momit-leaf-frame-pointer > > -B$HOME/opt/binutils/bin -Wa,--gsframe=no > > -fno-asynchronous-unwind-tables -funwind-tables" > > sframe="-fomit-frame-pointer -momit-leaf-frame-pointer > > -B$HOME/opt/binutils/bin -Wa,--gsframe -fno-asynchronous-unwind-tables > > -funwind-tables" > > > > conf custom-fp-sync -DCMAKE_{C,CXX}_FLAGS="$clang $fp" > > conf custom-sframe-sync -DCMAKE_{C,CXX}_FLAGS="$clang $sframe" > > conf custom-fp-gcc-sync -DCMAKE_{C,CXX}_FLAGS="$fp" ${gcc[@]} > > conf custom-sframe-gcc-sync -DCMAKE_{C,CXX}_FLAGS="$sframe" ${gcc[@]} > > > > for i in fp-sync sframe-sync fp-gcc-sync sframe-gcc-sync; do ninja -C > > /tmp/out/custom-$i llvm-mc opt clang; done > > > > % ~/Dev/unwind-info-size-analyzer/section_size.rb > > /tmp/out/custom-{fp,sframe}-{,gcc-}sync/bin/{llvm-mc,opt,clang} > > Filename | .text size | > > EH size | .sframe size | VM size | VM increase > > --------------------------------------------+------------------+----------------+----------------+-----------+------------ > > /tmp/out/custom-fp-sync/bin/llvm-mc | 2124031 (23.5%) | > > 301136 (3.3%) | 0 (0.0%) | 9050149 | - > > /tmp/out/custom-sframe-sync/bin/llvm-mc | 2114383 (22.3%) | > > 367452 (3.9%) | 348235 (3.7%) | 9483621 | +4.8% > > /tmp/out/custom-fp-gcc-sync/bin/llvm-mc | 2744214 (29.2%) | > > 301836 (3.2%) | 0 (0.0%) | 9389677 | +3.8% > > /tmp/out/custom-sframe-gcc-sync/bin/llvm-mc | 2705860 (27.7%) | > > 354292 (3.6%) | 356073 (3.6%) | 9780985 | +8.1% > > /tmp/out/custom-fp-sync/bin/opt | 38873081 (69.9%) | > > 3538408 (6.4%) | 0 (0.0%) | 55598521 | - > > /tmp/out/custom-sframe-sync/bin/opt | 39011423 (62.4%) | > > 4557012 (7.3%) | 4452908 (7.1%) | 62494765 | +12.4% > > /tmp/out/custom-fp-gcc-sync/bin/opt | 54654535 (78.1%) | > > 3631076 (5.2%) | 0 (0.0%) | 70001573 | +25.9% > > /tmp/out/custom-sframe-gcc-sync/bin/opt | 53644831 (70.4%) | > > 4857220 (6.4%) | 5263530 (6.9%) | 76205733 | +37.1% > > /tmp/out/custom-fp-sync/bin/clang | 68345753 (73.8%) | > > 6643384 (7.2%) | 0 (0.0%) | 92638305 | - > > /tmp/out/custom-sframe-sync/bin/clang | 68500319 (64.9%) | > > 8684540 (8.2%) | 8521760 (8.1%) | 105572021 | +14.0% > > /tmp/out/custom-fp-gcc-sync/bin/clang | 96515079 (82.8%) | > > 6556756 (5.6%) | 0 (0.0%) | 116524565 | +25.8% > > /tmp/out/custom-sframe-gcc-sync/bin/clang | 94583903 (74.0%) | > > 8817628 (6.9%) | 9696263 (7.6%) | 127839309 | +38.0% > > > > Note: in GCC FP builds, .text is larger due to missing optimization > > for RBP-based frames (e.g. > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108386). Once this > > optimization is implemented, GCC FP builds should actually have > > smaller .text than RSP-based builds, because RBP-relative addressing > > produces more compact encodings than RSP-relative addressing (which > > requires an extra SIB byte). > > > > .sframe for sync is not noticeably smaller than that for async. This > > is probably because > > there are still many DW_CFA_advance_loc ops even in > > -fno-asynchronous-unwind-tables -funwind-tables builds. > > > > Possible that its because in the Apple Compact Unwind Format, the linker > optimizes compact unwind descriptors into the three-level paged > structure, effectively de-duplicating some content. Yes, the linker does perform deduplication and builds the paged index structure. However, the fundamental compactness comes from the encoding itself: each regular function is described with just 4 bytes in the common encoding, compared to .sframe's much larger per-FDE overhead. The two-level lookup table optimization amplifies this advantage. > >>> (On macOS you can check the section size with objdump --arch x86_64 - > >>> h clang and dump the unwind info with objdump --arch x86_64 --unwind- > >>> info clang) > >>> > >>> OpenVMS's x86-64 port, which is ELF-based, also adopted this format as > >>> documented in their "VSI OpenVMS Calling Standard" and their 2018 post: > >>> https://discourse.llvm.org/t/rfc-asynchronous-unwind-tables-attribute/59282 > >>> > >>> The compact unwind format achieves this efficiency through a two-level > >>> page table structure. It describes common frame layouts compactly and > >>> falls back to DWARF only when necessary, allowing most DWARF CFI entries > >>> to be eliminated while maintaining full functionality. For more details, > >>> see: https://faultlore.com/blah/compact-unwinding/ and the lld/MachO > >>> implemention https://github.com/llvm/llvm-project/blob/main/lld/MachO/ > >>> UnwindInfoSection.cpp > >>> > >> > >> How does your vision of "linker-friendly" stack tracing/stack unwinding > >> format reconcile with these suggested approaches ? As far as I can tell, > >> these formats also require linker created indexes and are > >> non-concatenable (custom handling in every linker). Something you've > >> had "significant concerns" about. > >> > > This question is unanswered: What do you think about > "linker-friendliness" of the current implementation of the lld/MachO > implementation of the compact unwind format in LLVM ? The linker input and output use different section names, so a dumb linker would work as long as the runtime accepts the concatenated sections. My vision for an ELF compact unwind format uses separate section names for link-time vs. runtime representations. The compiler output format should be concatenable, with linker index-building as an optional optimization that improves performance but isn't mandatory for correctness. I'll going to add more details https://maskray.me/blog/2025-09-28-remarks-on-sframe > > > > We can distinguish between linking-time and execution-time > > representations by using different section names. > > The OpenVMS specification says: > > > > It is useful to note that the run-time representation of unwind > > information can vary from little more than a simple concatenation of > > the compile-time information to a substantial rewriting of unwind > > information by the linker. The proposal favors simple concatenation > > while maintaining the same ordering of groups as their associated > > code. > > > > The runtime library can build this index at runtime and cache it to disk. > > > > This will include the dynamic linker and the stack tracer in the Linux > kernel (the latter when stack tracing user space stacks). Do you think > this is feasible ? > > > Once the design becomes sufficiently stable, we can introduce an > > opt-in linker option --xxxxframe-index that builds an index from > > recognized format versions while reporting warnings for unrecognized > > ones.> We need to carefully design this mechanism to be stable and robust, > > avoiding frequent format updates. > > >> From > >> https://docs.vmssoftware.com/vsi-openvms-calling-standard/#STACK_UNWIND_EXCEPTION_X86_64: > >> "The unwind dispatch table (see Section B.3.1, ''Unwind Dispatch > >> Table'') is created by the linker using information in the unwind > >> descriptors (see Section B.3.2, ''DWARF Unwind Descriptors'' and Section > >> B.3.3, ''Compact Unwind Description'') provided by compilers. The linker > >> may use the provided unwind descriptors directly or replace them with > >> equivalent optimized forms based on its optimization strategies." > >> > >> Above all, do users want a solution which requires falling back on > >> DWARF-based processing for precise stack tracing ? > > > > The key distinction is that compact unwind handles the vast majority > > of functions without DWARF—the macOS measurements show __unwind_info > > at 0.6% of __text size with __eh_frame reduced to negligible size > > (0x58 bytes). While SFrame also cannot handle all frames, compact > > unwind achieves dramatic size reductions by making DWARF the exception > > rather than requiring it alongside a supplementary format. > > > > As we have tried to reason, this is a misleading comparison. The compact > unwind tables format: > - needs to be extended for asynchronous stack unwinding > - needs to be extended for other ABI/architectures > - Making it concatenable / linker-friendly will also likely impose > some negative effects on size. The format supports i386, x86-64, aarch32, and aarch64. The OpenVMS proposal demonstrates that supporting asynchronous unwinding is straightforward. Making it linker-friendly does not impose negative effects on the output section size. > > The DWARF fallback provides flexibility for additional coverage when > > needed, but nothing is lost (at least for the clang binary on macOS) > > if DWARF fallback were disabled in a hypothetical future linux-perf > > implementation. > > > > Fair enough, thats something for linux-perf/kernel to decide. Once the > OpenVMS RFC is sufficiently shaped to become a viable replacement for > .eh_frame, this question will be for the stakeholders to decide. Agreed. My concern is that .sframe is being deployed before we've fully explored whether a more compact and efficient alternative is achievable. > >>> **The AArch64 case: size matters even more** > >>> > >>> The size consideration becomes even more critical for AArch64, which is > >>> heavily deployed on mobile phones. > >>> There's an active feature request for compact unwind support in the > >>> AArch64 ABI: https://github.com/ARM-software/abi-aa/issues/344 > >>> This underscores the broader industry need for efficient unwind > >>> information that doesn't duplicate data or significantly increase binary > >>> size. > >>> > >> > >> Our measurements with a dataset of about 1400 userspace artifacts > >> (binaries and shared libraries) show that the SFrame/(EH Frame + EH > >> Frame HDR) ratio is: > >> - Average of 0.70 on AArch64. > >> - Average of 1.00 on x86_64. > >> > >> Projecting the size of what you observe for clang binary on x86_64 to > >> conclude the size ratio on AArch64 is not very wise to do. > >> > >> Whether the size impact is worth the benefit: its a choice for users to > >> make. SFrame offers the users fast, precise stack traces with simple > >> stack tracers. > > > > Thank you for providing the AArch64 measurements. Even with a 0.70x ratio on > > AArch64, this represents substantial memory overhead when considering: > > > > .eh_frame is already large and being complained about. > > Being unable to eliminate it (needed for debugging and C++ exceptions) > > and adding 0.70x more means significant additional overhead for users. > > > >>> There are at least two formats the ELF one can learn from: LLVM's > >>> compact unwind format (aarch64) and Windows ARM64 Frame Unwind Code. > >>> > >> > >> Please, if you have any concrete suggestions (keeping the above goals in > >> mind), you already know how/where to engage. > > > > I've provided concrete suggestions throughout this discussion. > > > > Apologies, I should have been more precise. And I ask because you know > the details about both SFrame and the variants of Compact Unwind > Descriptor formats at this point :). If you have concrete suggestions to > improve the SFrame format for size, please let us know. At this point, I'm not certain about specific modifications to .sframe itself. I think we should start from scratch, drawing ideas from compact unwind information and Windows ARM64. The existing compact unwind information uses the following 4-byte descriptor: uint32_t mode_specific_encoding : 24; // vary with different modes uint32_t mode : 4; // UNWIND_X86_64_MODE_MASK == UNWIND_ARM64_MODE_MASK uint32_t has_lsda : 1; uint32_t personality_index : 2; uint32_t is_not_function_start : 1; We probably need a less-restricted version and account for different architecture needs. The result would still be significantly smaller than SFrame v2 and the future v3 (unless it's completely rewritten). We should probably design an optional two-level lookup table mechanism for additional savings (at the cost of linker friendliness). > >>> **Path forward** > >>> > >>> Unless SFrame can actually replace .eh_frame (rather than supplementing > >>> it as an accelerator for linux-perf) and demonstrate sizes smaller > >>> than .eh_frame - matching the efficiency of existing compact unwind > >>> approaches — I question its practical viability for userspace. > >>> The current design appears to add overhead rather than reduce it. > >>> This isn't to suggest we should simply adopt the existing compact unwind > >>> format wholesale. > >>> The x86-64 design dates back to 2009 or earlier, and there are likely > >>> improvements we can make. However, we should aim for similar or better > >>> efficiency gains. > >>> > >>> For additional context, I've documented my detailed analysis at: > >>> > >>> - https://maskray.me/blog/2025-09-28-remarks-on-sframe (covering > >>> mandatory index building problems, section group compliance and garbage > >>> collection issues, and version compatibility challenges) > >> > >> GC issue is a bug currently tracked and with a target milestone of 2.46. > >>> - https://maskray.me/blog/2025-10-26-stack-walking-space-and-time-trade- > >>> offs (size analysis) > >>> The GC issue would not have happened at all if we had used multiple sections and thought about ELF and linker convention :) ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-11-06 9:20 ` Fangrui Song @ 2025-11-06 20:42 ` Indu Bhagat 2025-11-09 0:23 ` Fangrui Song 0 siblings, 1 reply; 30+ messages in thread From: Indu Bhagat @ 2025-11-06 20:42 UTC (permalink / raw) To: Fangrui Song; +Cc: linux-toolchains, linux-perf-users, linux-kernel On 11/6/25 1:20 AM, Fangrui Song wrote: > On Wed, Nov 5, 2025 at 4:45 PM Indu Bhagat <indu.bhagat@oracle.com> wrote: >> >> On 11/5/25 12:21 AM, Fangrui Song wrote: >>>> On Tue, Nov 4, 2025 at 1:21 AM Indu <indu.bhagat@oracle.com> wrote: >>>> On 2025-10-29 11:53 p.m., Fangrui Song wrote: >>>>> I've been following the SFrame discussion and wanted to share some >>>>> concerns about its viability for userspace adoption, based on concrete >>>>> measurements and comparison with existing compact unwind implementations >>>>> in LLVM. >>>>> >>>>> **Size overhead concerns** >>>>> >>>>> Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is >>>>> approximately 10% larger than the combined size of .eh_frame >>>>> and .eh_frame_hdr (8.06 MiB total). >>>>> This is problematic because .eh_frame cannot be eliminated - it contains >>>>> essential information for restoring callee-saved registers, LSDA, and >>>>> personality information needed for debugging (e.g. reading local >>>>> variables in a coredump) and C++ exception handling. >>>>> >>>>> This means adopting SFrame would result in carrying both formats, with a >>>>> large net size increase. >>>>> >>>>> **Learning from existing compact unwind implementations** >>>>> >>>>> It's worth noting that LLVM has had a battle-tested compact unwind >>>>> format in production use since 2009 with OS X 10.6, which transitioned >>>>> to using CFI directives in 2013 [1]. The efficiency gains are dramatic: >>>>> >>>>> __text section: 0x4a55470 bytes >>>>> __unwind_info section: 0x79060 bytes (0.6% of __text) >>>>> __eh_frame section: 0x58 bytes >>>>> >>>> >>>> I believe this is only synchronous? If yes, do you think this is a fair >>>> measurement to compare against ? >>>> >>>> Does the compact unwind info scheme work well for cases of >>>> shrink-wrapping ? How about the case of AArch64, where the ABI does not >>>> mandate if and where frame record is created ? >>>> >>>> For the numbers above, does it ensure precise stack traces ? >>>> >>>> From the The Apple Compact Unwinding Format document >>>> (https://faultlore.com/blah/compact-unwinding/), >>>> "One consequence of only having one opcode for a whole function is that >>>> functions will generally have incorrect instructions for the function’s >>>> prologue (where callee-saved registers are individually PUSHed onto the >>>> stack before the rest of the stack space is allocated)." >>>> >>>> "Presumably this isn’t a very big deal, since there’s very few >>>> situations where unwinding would involve a function still executing its >>>> prologue/epilogue." >>>> >>>> Well, getting precise stack traces is a big deal and the users want them. >>> >>> **Shrink-wrapping and precise stack traces**: Yes, compact unwind >>> handles these through an extension proposed by OpenVMS (not yet >>> upstreamed to LLVM): >>> https://lists.llvm.org/pipermail/llvm-dev/2018-January/120741.html >>> >> >> Thanks for the link. >> >> The above questions were strictly in the context of the battle-tested >> "The Apple Compact Unwinding Format" in production in the lld/MachO >> implementation, not for the proposed OpenVMS extensions. >> >> Is it possible to get answers to those questions with that context in place? >> >> If shrink-wrapping and precise stack traces isnt supported without the >> OpenVMS extension (that is not yet implemented), arent we comparing >> apples vs pears here ? > > You're right to ask for clarification. > The extended compact unwind information works with shrink wrapping. > Sorry, again, not asking about the "extended". If I may: So, this is a convoluted way of saying the current implementation of the Apple Compact Unwind Info (lld/MachO, which was used to get the data) does not support shrink wrapping. The documentation of the format I am refering to (https://faultlore.com/blah/compact-unwinding/). That said, the point I have been driving to: The Apple Compact Unwind format (https://faultlore.com/blah/compact-unwinding/) does not support shrink wrapping and neither is for asynchronous stack walking. Comparing that data to what SFrame gives is comparing apples to pears. Misleading. (The reason I asked the question to begin with is because I wasn't sure if the documentation is out of date). > For context, a FDE in .eh_frame costs at least 20 bytes (often 30+), > plus its associated .eh_frame_hdr entry costs 8 bytes. > Even a larger compact unwind descriptor at 8 bytes yields significant > savings compared to .eh_frame. Tripling that to 24 bytes is still a > substantial win. > > Additionally, very few functions benefit from shrink wrapping > optimization. When needed, we require multiple unwind description > records (typically 3). > >>> Technical details of the extension: >>> >>> - A single unwind group describes a (prologue_part1, prologue_part2, >>> body, epilogue) tuple. >>> - The prologue is conceptually split into two parts: the first part >>> extends up to and including the instruction that decreases RSP; the >>> second part extends to a point after the last preserved register is >>> saved but before any preserved register is modified (this location is >>> not unique, providing flexibility). >>> + When unwinding in the prologue, the RSP register value can be >>> inferred from the PC and the set of saved registers. >>> - Since register restoration is idempotent (restoring preserved >>> registers multiple times during unwinding causes no harm), there is no >>> need to describe `pop $reg` sequences. The unwind group needs just one >>> bit to describe whether the 1-byte `ret` instruction is present. >> >> Is this true for the case of asynchronous stack tracing too ? > > Yes. I believe it means the epilogue mirrors the prologue. Since we > know which registers were saved in the prologue, we can infer the pop > instructions in the epilogue and compute the SP offset when unwinding > in the middle of an epilogue. > This is not asynchronous then. This meddles with the core business of an optimizing compiler which may want to organize epilogue/prologue differently. >>> - The `length` field in the compact unwind group descriptor is >>> repurposed to describe the prologue's two parts. >>> - By composing multiple unwind groups, potentially with zero-sized >>> prologues or omitting `ret` instructions in epilogues, it can describe >>> functions with shrink wrapping or tail duplication optimization. >>> - Null frame groups (with no prologue or epilogue) are the default and >>> can describe trampolines and PLT stubs. >> >> PLT stubs may use stack (push to stack). As per the document "A null >> frame (MODE = 8) is the simplest possible frame, with no allocated stack >> of either kind (hence no saved registers)". So null frame can be used >> for PLT only if the functions invoking the PLT stub were using an >> RBP-based frame. Isnt it ? >> BTW, but both EH Frame and SFrame have specific, condensed >> representation for metadata for PLT entries. > > A profiler can trivially retrieve the return address using the default > rule: if a code region is not covered by metadata, assume the return > address is available at *rsp (x86-64) or in the link register (most > other architectures). > > This ld-generated unwind info feature is largely obsolete nowadays due > to the prevailing use of -Wl,-z,relro,-z,now (BIND_NOW). PLT entries > behave as functions without a prologue, so a profiler can trivially > retrieve the return address using the default unwinding rule. > >>> >> >> Anyway, thanks for the summary. >> >> I see that OpenVMS extension for asynchronous compact unwind descriptors >> is an RFC state ATM. But few important observations and questions: >> >> - As noted in the recently revived discussion, >> https://discourse.llvm.org/t/rfc-improving-compact-x86-64-compact-unwind-descriptors/47471, >> there is going to be a *non-negligible* size overhead as soon as you >> move towards a specification for asynchronous (vs the current >> specification that caters to synchronous only). Now add to it, the >> quirks of each architecture/ABI :). Any comments ? > > As mentioned, even a larger compact unwind descriptor at 8 bytes > yields significant savings compared to .eh_frame, and is also > substantially smaller than SFrame. > >> - From the document: "Use of any preserved register must be delayed >> until all of the preserved registers have been saved." >> Q: Does this work well with optimizing compilers ? Is this an ABI >> change being asked for multiple architectures ? > > I think this is about support for callee-saved registers, a feature > SFrame doesn't have. > SFrame doesn't have it, because it doesnt need to carry this information for stack tracing. OpenVMS RFC effort, OTOH, is about subsuming .eh_frame and be _the_ stack tracing/stack unwinding format. The latter *has to* work this out. > I need to think about the details, but this thread is probably not the > best place to discuss them. > Absolutely, I agree, not the best place or time to pin down the details of an RFC at all. But cannot let an unfair argument just fly by. The point I am driving to with these questions around the OpenVMS asynchronous info RFC: - 'OpenVMS extensions for asynchronous stack unwinding' in an RFC which still needs work. - It remains to be seen how this proposal manages the fine line of space-efficiency while trying to be the goto format for asynchronous stack unwinding together with fast, precise and low-overhead stack tracing. - SFrame is for stack tracing only. Subsuming .eh_frame is not in the plans. >> - From the document: "It appears technically feasible for a null frame >> function to have a personality routine. However, the utility of such a >> capability seems too meager to justify allowing this. We propose to not >> support this." and "If the first attempt to lookup an unwind group for >> an exception address fails, then it is (tentatively) assumed to have >> occurred within a null frame function or in a part of a function >> that is adequately described by a null frame. The presumed return >> address is (virtually or actually) popped from the top of stack and >> looked up. This second attempted lookup must succeed, in which case >> processing continues normally. A failure is a fatal error." >> Q: Is this a problem, especially because the goal is to evolve the >> OpenVMS RFC proposal is subsume .eh_frame ? > > I think this just hard-encodes the default rule, similar to what > SFrame does: "AMD64 ABI mandates that the RA be saved at a fixed > offset from the CFA when entering a new function." > > While I haven't given this much thought yet, I don't think this > introduces problems that SFrame doesn't have. > Correction: Not true. This is configurable in SFrame. s390x needs RA tracking (not fixed offset) and is supported in SFrame. >> Are there people actively working towards bringing this to fruition? >> >>> Now, to compare this against SFrame's space efficiency for synchronous >>> unwinding, I've built llvm-mc, opt, and clang with >>> -fno-asynchronous-unwind-tables -funwind-tables across multiple build >>> configurations (clang vs gcc, frame pointer vs sframe). >>> [snip]>>> >>> .sframe for sync is not noticeably smaller than that for async. This >>> is probably because >>> there are still many DW_CFA_advance_loc ops even in >>> -fno-asynchronous-unwind-tables -funwind-tables builds. >>> >> >> Possible that its because in the Apple Compact Unwind Format, the linker >> optimizes compact unwind descriptors into the three-level paged >> structure, effectively de-duplicating some content. > > Yes, the linker does perform deduplication and builds the paged index > structure. However, the fundamental compactness comes from the > encoding itself: each regular function is described with just 4 bytes > in the common encoding, compared to .sframe's much larger per-FDE > overhead. > The two-level lookup table optimization amplifies this advantage. > >>>>> (On macOS you can check the section size with objdump --arch x86_64 - >>>>> h clang and dump the unwind info with objdump --arch x86_64 --unwind- >>>>> info clang) >>>>> >>>>> OpenVMS's x86-64 port, which is ELF-based, also adopted this format as >>>>> documented in their "VSI OpenVMS Calling Standard" and their 2018 post: >>>>> https://discourse.llvm.org/t/rfc-asynchronous-unwind-tables-attribute/59282 >>>>> >>>>> The compact unwind format achieves this efficiency through a two-level >>>>> page table structure. It describes common frame layouts compactly and >>>>> falls back to DWARF only when necessary, allowing most DWARF CFI entries >>>>> to be eliminated while maintaining full functionality. For more details, >>>>> see: https://faultlore.com/blah/compact-unwinding/ and the lld/MachO >>>>> implemention https://github.com/llvm/llvm-project/blob/main/lld/MachO/ >>>>> UnwindInfoSection.cpp >>>>> >>>> >>>> How does your vision of "linker-friendly" stack tracing/stack unwinding >>>> format reconcile with these suggested approaches ? As far as I can tell, >>>> these formats also require linker created indexes and are >>>> non-concatenable (custom handling in every linker). Something you've >>>> had "significant concerns" about. >>>> >> >> This question is unanswered: What do you think about >> "linker-friendliness" of the current implementation of the lld/MachO >> implementation of the compact unwind format in LLVM ? > > The linker input and output use different section names, so a dumb > linker would work as long as the runtime accepts the concatenated > sections. > > My vision for an ELF compact unwind format uses separate section names > for link-time vs. runtime representations. The compiler output format > should be concatenable, with linker index-building as an optional > optimization that improves performance but isn't mandatory for > correctness. > > I'll going to add more details > https://maskray.me/blog/2025-09-28-remarks-on-sframe > > >>> >>> We can distinguish between linking-time and execution-time >>> representations by using different section names. >>> The OpenVMS specification says: >>> >>> It is useful to note that the run-time representation of unwind >>> information can vary from little more than a simple concatenation of >>> the compile-time information to a substantial rewriting of unwind >>> information by the linker. The proposal favors simple concatenation >>> while maintaining the same ordering of groups as their associated >>> code. >>> >>> The runtime library can build this index at runtime and cache it to disk. >>> >> >> This will include the dynamic linker and the stack tracer in the Linux >> kernel (the latter when stack tracing user space stacks). Do you think >> this is feasible ? >> >>> Once the design becomes sufficiently stable, we can introduce an >>> opt-in linker option --xxxxframe-index that builds an index from >>> recognized format versions while reporting warnings for unrecognized >>> ones.> We need to carefully design this mechanism to be stable and robust, >>> avoiding frequent format updates. >>>>> From >>>> https://docs.vmssoftware.com/vsi-openvms-calling-standard/#STACK_UNWIND_EXCEPTION_X86_64: >>>> "The unwind dispatch table (see Section B.3.1, ''Unwind Dispatch >>>> Table'') is created by the linker using information in the unwind >>>> descriptors (see Section B.3.2, ''DWARF Unwind Descriptors'' and Section >>>> B.3.3, ''Compact Unwind Description'') provided by compilers. The linker >>>> may use the provided unwind descriptors directly or replace them with >>>> equivalent optimized forms based on its optimization strategies." >>>> >>>> Above all, do users want a solution which requires falling back on >>>> DWARF-based processing for precise stack tracing ? >>> >>> The key distinction is that compact unwind handles the vast majority >>> of functions without DWARF—the macOS measurements show __unwind_info >>> at 0.6% of __text size with __eh_frame reduced to negligible size >>> (0x58 bytes). While SFrame also cannot handle all frames, compact >>> unwind achieves dramatic size reductions by making DWARF the exception >>> rather than requiring it alongside a supplementary format. >>> >> >> As we have tried to reason, this is a misleading comparison. The compact >> unwind tables format: >> - needs to be extended for asynchronous stack unwinding >> - needs to be extended for other ABI/architectures >> - Making it concatenable / linker-friendly will also likely impose >> some negative effects on size. > > The format supports i386, x86-64, aarch32, and aarch64. The OpenVMS > proposal demonstrates that supporting asynchronous unwinding is > straightforward. > > Making it linker-friendly does not impose negative effects on the > output section size. > OK, well, I agree to disagree :) Looking forward to some movement on the OpenVMS asynchronous unwind RFC to see resolution to some of the issues, and some data to back that claim. >>> The DWARF fallback provides flexibility for additional coverage when >>> needed, but nothing is lost (at least for the clang binary on macOS) >>> if DWARF fallback were disabled in a hypothetical future linux-perf >>> implementation. >>> >> >> Fair enough, thats something for linux-perf/kernel to decide. Once the >> OpenVMS RFC is sufficiently shaped to become a viable replacement for >> .eh_frame, this question will be for the stakeholders to decide. > > Agreed. My concern is that .sframe is being deployed before we've > fully explored whether a more compact and efficient alternative is > achievable. > > >>>>> **The AArch64 case: size matters even more** >>>>> >>>>> The size consideration becomes even more critical for AArch64, which is >>>>> heavily deployed on mobile phones. >>>>> There's an active feature request for compact unwind support in the >>>>> AArch64 ABI: https://github.com/ARM-software/abi-aa/issues/344 >>>>> This underscores the broader industry need for efficient unwind >>>>> information that doesn't duplicate data or significantly increase binary >>>>> size. >>>>> >>>> >>>> Our measurements with a dataset of about 1400 userspace artifacts >>>> (binaries and shared libraries) show that the SFrame/(EH Frame + EH >>>> Frame HDR) ratio is: >>>> - Average of 0.70 on AArch64. >>>> - Average of 1.00 on x86_64. >>>> >>>> Projecting the size of what you observe for clang binary on x86_64 to >>>> conclude the size ratio on AArch64 is not very wise to do. >>>> >>>> Whether the size impact is worth the benefit: its a choice for users to >>>> make. SFrame offers the users fast, precise stack traces with simple >>>> stack tracers. >>> >>> Thank you for providing the AArch64 measurements. Even with a 0.70x ratio on >>> AArch64, this represents substantial memory overhead when considering: >>> >>> .eh_frame is already large and being complained about. >>> Being unable to eliminate it (needed for debugging and C++ exceptions) >>> and adding 0.70x more means significant additional overhead for users. >>> >>>>> There are at least two formats the ELF one can learn from: LLVM's >>>>> compact unwind format (aarch64) and Windows ARM64 Frame Unwind Code. >>>>> >>>> >>>> Please, if you have any concrete suggestions (keeping the above goals in >>>> mind), you already know how/where to engage. >>> >>> I've provided concrete suggestions throughout this discussion. >>> >> >> Apologies, I should have been more precise. And I ask because you know >> the details about both SFrame and the variants of Compact Unwind >> Descriptor formats at this point :). If you have concrete suggestions to >> improve the SFrame format for size, please let us know. > > At this point, I'm not certain about specific modifications to .sframe > itself. I think we should start from scratch, drawing ideas from > compact unwind information and Windows ARM64. > > The existing compact unwind information uses the following 4-byte descriptor: > > uint32_t mode_specific_encoding : 24; // vary with different modes > > uint32_t mode : 4; // UNWIND_X86_64_MODE_MASK == UNWIND_ARM64_MODE_MASK > > uint32_t has_lsda : 1; > uint32_t personality_index : 2; > uint32_t is_not_function_start : 1; > Thanks. SFrame is not for stack unwinding. Subsuming .eh_frame is topic for another day. SFrame does not intend to go that route. > We probably need a less-restricted version and account for different > architecture needs. The result would still be significantly smaller > than SFrame v2 and the future v3 (unless it's completely rewritten). > > We should probably design an optional two-level lookup table mechanism > for additional savings (at the cost of linker friendliness). > >>>>> **Path forward** >>>>> >>>>> Unless SFrame can actually replace .eh_frame (rather than supplementing >>>>> it as an accelerator for linux-perf) and demonstrate sizes smaller >>>>> than .eh_frame - matching the efficiency of existing compact unwind >>>>> approaches — I question its practical viability for userspace. >>>>> The current design appears to add overhead rather than reduce it. >>>>> This isn't to suggest we should simply adopt the existing compact unwind >>>>> format wholesale. >>>>> The x86-64 design dates back to 2009 or earlier, and there are likely >>>>> improvements we can make. However, we should aim for similar or better >>>>> efficiency gains. >>>>> >>>>> For additional context, I've documented my detailed analysis at: >>>>> >>>>> - https://maskray.me/blog/2025-09-28-remarks-on-sframe (covering >>>>> mandatory index building problems, section group compliance and garbage >>>>> collection issues, and version compatibility challenges) >>>> >>>> GC issue is a bug currently tracked and with a target milestone of 2.46. >>>>> - https://maskray.me/blog/2025-10-26-stack-walking-space-and-time-trade- >>>>> offs (size analysis) >>>>> > > The GC issue would not have happened at all if we had used multiple > sections and thought about ELF and linker convention :) Thanks for engaging. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-11-06 20:42 ` Indu Bhagat @ 2025-11-09 0:23 ` Fangrui Song 0 siblings, 0 replies; 30+ messages in thread From: Fangrui Song @ 2025-11-09 0:23 UTC (permalink / raw) To: Indu Bhagat Cc: Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel On Thu, Nov 6, 2025 at 12:42 PM Indu Bhagat <indu.bhagat@oracle.com> wrote: > > On 11/6/25 1:20 AM, Fangrui Song wrote: > > On Wed, Nov 5, 2025 at 4:45 PM Indu Bhagat <indu.bhagat@oracle.com> wrote: > >> > >> On 11/5/25 12:21 AM, Fangrui Song wrote: > >>>> On Tue, Nov 4, 2025 at 1:21 AM Indu <indu.bhagat@oracle.com> wrote: > >>>> On 2025-10-29 11:53 p.m., Fangrui Song wrote: > >>>>> I've been following the SFrame discussion and wanted to share some > >>>>> concerns about its viability for userspace adoption, based on concrete > >>>>> measurements and comparison with existing compact unwind implementations > >>>>> in LLVM. > >>>>> > >>>>> **Size overhead concerns** > >>>>> > >>>>> Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is > >>>>> approximately 10% larger than the combined size of .eh_frame > >>>>> and .eh_frame_hdr (8.06 MiB total). > >>>>> This is problematic because .eh_frame cannot be eliminated - it contains > >>>>> essential information for restoring callee-saved registers, LSDA, and > >>>>> personality information needed for debugging (e.g. reading local > >>>>> variables in a coredump) and C++ exception handling. > >>>>> > >>>>> This means adopting SFrame would result in carrying both formats, with a > >>>>> large net size increase. > >>>>> > >>>>> **Learning from existing compact unwind implementations** > >>>>> > >>>>> It's worth noting that LLVM has had a battle-tested compact unwind > >>>>> format in production use since 2009 with OS X 10.6, which transitioned > >>>>> to using CFI directives in 2013 [1]. The efficiency gains are dramatic: > >>>>> > >>>>> __text section: 0x4a55470 bytes > >>>>> __unwind_info section: 0x79060 bytes (0.6% of __text) > >>>>> __eh_frame section: 0x58 bytes > >>>>> > >>>> > >>>> I believe this is only synchronous? If yes, do you think this is a fair > >>>> measurement to compare against ? > >>>> > >>>> Does the compact unwind info scheme work well for cases of > >>>> shrink-wrapping ? How about the case of AArch64, where the ABI does not > >>>> mandate if and where frame record is created ? > >>>> > >>>> For the numbers above, does it ensure precise stack traces ? > >>>> > >>>> From the The Apple Compact Unwinding Format document > >>>> (https://faultlore.com/blah/compact-unwinding/), > >>>> "One consequence of only having one opcode for a whole function is that > >>>> functions will generally have incorrect instructions for the function’s > >>>> prologue (where callee-saved registers are individually PUSHed onto the > >>>> stack before the rest of the stack space is allocated)." > >>>> > >>>> "Presumably this isn’t a very big deal, since there’s very few > >>>> situations where unwinding would involve a function still executing its > >>>> prologue/epilogue." > >>>> > >>>> Well, getting precise stack traces is a big deal and the users want them. > >>> > >>> **Shrink-wrapping and precise stack traces**: Yes, compact unwind > >>> handles these through an extension proposed by OpenVMS (not yet > >>> upstreamed to LLVM): > >>> https://lists.llvm.org/pipermail/llvm-dev/2018-January/120741.html > >>> > >> > >> Thanks for the link. > >> > >> The above questions were strictly in the context of the battle-tested > >> "The Apple Compact Unwinding Format" in production in the lld/MachO > >> implementation, not for the proposed OpenVMS extensions. > >> > >> Is it possible to get answers to those questions with that context in place? > >> > >> If shrink-wrapping and precise stack traces isnt supported without the > >> OpenVMS extension (that is not yet implemented), arent we comparing > >> apples vs pears here ? > > > > You're right to ask for clarification. > > The extended compact unwind information works with shrink wrapping. > > > > Sorry, again, not asking about the "extended". > > If I may: So, this is a convoluted way of saying the current > implementation of the Apple Compact Unwind Info (lld/MachO, which was > used to get the data) does not support shrink wrapping. The > documentation of the format I am refering to > (https://faultlore.com/blah/compact-unwinding/). > > That said, the point I have been driving to: > > The Apple Compact Unwind format > (https://faultlore.com/blah/compact-unwinding/) does not support shrink > wrapping and neither is for asynchronous stack walking. Comparing that > data to what SFrame gives is comparing apples to pears. Misleading. > > (The reason I asked the question to begin with is because I wasn't sure > if the documentation is out of date). The original compact unwind information implementation was designed in 2009, before shrink wrapping was implemented in LLVM in 2015. It is definitely not fully asynchronous as it lacks information about the epilogue. When unwinding in the middle of the prologue, one can recover partial information leveraging the prologue codegen pattern, probably good enough to recover SP in the absence of shrink wrapping. While there are limitations, it does not mean we cannot yield useful data from it. In a x86-64 build of clang-21, there is one single CIE and 141845 FDEs. The average size of a FDE is: (0x733348 - 0x18) / 141845 ~= 52.225 (0x18 is the first FDE offset in llvm-dwarfdump -eh-frame output). Considering .eh_frame_hdr entry, per-function size is around 52.225+8 = 60.225. The .sframe V2 per-function size is 0x820820 / 141845 ~= 60.078. On LLVM Discourse we are discussing the next generation of compact unwind information, which will support at least asynchronous stack tracing (the SFrame feature subset) and synchronous C++ exceptions. We aim to provide a per-entry size of 12 bytes. The average number of entries per function is likely between 1 and 2, making the scheme very size-efficient even without utilizing page table deduplication. > > For context, a FDE in .eh_frame costs at least 20 bytes (often 30+), > > plus its associated .eh_frame_hdr entry costs 8 bytes. > > Even a larger compact unwind descriptor at 8 bytes yields significant > > savings compared to .eh_frame. Tripling that to 24 bytes is still a > > substantial win. > > > > Additionally, very few functions benefit from shrink wrapping > > optimization. When needed, we require multiple unwind description > > records (typically 3). > > > >>> Technical details of the extension: > >>> > >>> - A single unwind group describes a (prologue_part1, prologue_part2, > >>> body, epilogue) tuple. > >>> - The prologue is conceptually split into two parts: the first part > >>> extends up to and including the instruction that decreases RSP; the > >>> second part extends to a point after the last preserved register is > >>> saved but before any preserved register is modified (this location is > >>> not unique, providing flexibility). > >>> + When unwinding in the prologue, the RSP register value can be > >>> inferred from the PC and the set of saved registers. > >>> - Since register restoration is idempotent (restoring preserved > >>> registers multiple times during unwinding causes no harm), there is no > >>> need to describe `pop $reg` sequences. The unwind group needs just one > >>> bit to describe whether the 1-byte `ret` instruction is present. > >> > >> Is this true for the case of asynchronous stack tracing too ? > > > > Yes. I believe it means the epilogue mirrors the prologue. Since we > > know which registers were saved in the prologue, we can infer the pop > > instructions in the epilogue and compute the SP offset when unwinding > > in the middle of an epilogue. > > > > This is not asynchronous then. > This meddles with the core business of an optimizing compiler which may > want to organize epilogue/prologue differently. Asynchronous as far as the compiler-generated patterns are concerned. Compilers do exhibit the patterns and we should utilize them, aiming for a compact format. We are trying to lift the restriction as much as possible when designing the new format. > >>> - The `length` field in the compact unwind group descriptor is > >>> repurposed to describe the prologue's two parts. > >>> - By composing multiple unwind groups, potentially with zero-sized > >>> prologues or omitting `ret` instructions in epilogues, it can describe > >>> functions with shrink wrapping or tail duplication optimization. > >>> - Null frame groups (with no prologue or epilogue) are the default and > >>> can describe trampolines and PLT stubs. > >> > >> PLT stubs may use stack (push to stack). As per the document "A null > >> frame (MODE = 8) is the simplest possible frame, with no allocated stack > >> of either kind (hence no saved registers)". So null frame can be used > >> for PLT only if the functions invoking the PLT stub were using an > >> RBP-based frame. Isnt it ? > >> BTW, but both EH Frame and SFrame have specific, condensed > >> representation for metadata for PLT entries. > > > > A profiler can trivially retrieve the return address using the default > > rule: if a code region is not covered by metadata, assume the return > > address is available at *rsp (x86-64) or in the link register (most > > other architectures). > > > > This ld-generated unwind info feature is largely obsolete nowadays due > > to the prevailing use of -Wl,-z,relro,-z,now (BIND_NOW). PLT entries > > behave as functions without a prologue, so a profiler can trivially > > retrieve the return address using the default unwinding rule. > > > >>> > >> > >> Anyway, thanks for the summary. > >> > >> I see that OpenVMS extension for asynchronous compact unwind descriptors > >> is an RFC state ATM. But few important observations and questions: > >> > >> - As noted in the recently revived discussion, > >> https://discourse.llvm.org/t/rfc-improving-compact-x86-64-compact-unwind-descriptors/47471, > >> there is going to be a *non-negligible* size overhead as soon as you > >> move towards a specification for asynchronous (vs the current > >> specification that caters to synchronous only). Now add to it, the > >> quirks of each architecture/ABI :). Any comments ? > > > > As mentioned, even a larger compact unwind descriptor at 8 bytes > > yields significant savings compared to .eh_frame, and is also > > substantially smaller than SFrame. > > > >> - From the document: "Use of any preserved register must be delayed > >> until all of the preserved registers have been saved." > >> Q: Does this work well with optimizing compilers ? Is this an ABI > >> change being asked for multiple architectures ? > > > > I think this is about support for callee-saved registers, a feature > > SFrame doesn't have. > > > > SFrame doesn't have it, because it doesnt need to carry this information > for stack tracing. OpenVMS RFC effort, OTOH, is about subsuming > .eh_frame and be _the_ stack tracing/stack unwinding format. The latter > *has to* work this out. This stance puts SFrame in a very narrow niche. Per-function unwind info of 120 bytes (EH+SFrame: 60+60) far exceeds the size the next-generation compact unwind information aims to achieve (likely <24 bytes even without using a page table). I believe the potential of the next-generation compact unwind information is clear. For this reason, I urge performance maintainers not to rush the integration of sframe v3 support. If these architectural design issues of SFrame aren't resolved beforehand, we risk launching a format that very few people will actually use. > > I need to think about the details, but this thread is probably not the > > best place to discuss them. > > > > Absolutely, I agree, not the best place or time to pin down the details > of an RFC at all. But cannot let an unfair argument just fly by. > > The point I am driving to with these questions around the OpenVMS > asynchronous info RFC: > - 'OpenVMS extensions for asynchronous stack unwinding' in an RFC which > still needs work. > - It remains to be seen how this proposal manages the fine line of > space-efficiency while trying to be the goto format for asynchronous > stack unwinding together with fast, precise and low-overhead stack tracing. > - SFrame is for stack tracing only. Subsuming .eh_frame is not in the > plans. > > >> - From the document: "It appears technically feasible for a null frame > >> function to have a personality routine. However, the utility of such a > >> capability seems too meager to justify allowing this. We propose to not > >> support this." and "If the first attempt to lookup an unwind group for > >> an exception address fails, then it is (tentatively) assumed to have > >> occurred within a null frame function or in a part of a function > >> that is adequately described by a null frame. The presumed return > >> address is (virtually or actually) popped from the top of stack and > >> looked up. This second attempted lookup must succeed, in which case > >> processing continues normally. A failure is a fatal error." > >> Q: Is this a problem, especially because the goal is to evolve the > >> OpenVMS RFC proposal is subsume .eh_frame ? > > > > I think this just hard-encodes the default rule, similar to what > > SFrame does: "AMD64 ABI mandates that the RA be saved at a fixed > > offset from the CFA when entering a new function." > > > > While I haven't given this much thought yet, I don't think this > > introduces problems that SFrame doesn't have. > > > > Correction: Not true. This is configurable in SFrame. s390x needs RA > tracking (not fixed offset) and is supported in SFrame. A hypothetical s390x implementation of the compact unwind information can reserve 1 bit (in the mode-specific-encoding, or "opcodes" in https://faultlore.com/blah/compact-unwinding/ ) to indicate whether the RA is saved in a stack slot or a register. > >> Are there people actively working towards bringing this to fruition? > >> > >>> Now, to compare this against SFrame's space efficiency for synchronous > >>> unwinding, I've built llvm-mc, opt, and clang with > >>> -fno-asynchronous-unwind-tables -funwind-tables across multiple build > >>> configurations (clang vs gcc, frame pointer vs sframe). > >>> [snip]>>> > >>> .sframe for sync is not noticeably smaller than that for async. This > >>> is probably because > >>> there are still many DW_CFA_advance_loc ops even in > >>> -fno-asynchronous-unwind-tables -funwind-tables builds. > >>> > >> > >> Possible that its because in the Apple Compact Unwind Format, the linker > >> optimizes compact unwind descriptors into the three-level paged > >> structure, effectively de-duplicating some content. > > > > Yes, the linker does perform deduplication and builds the paged index > > structure. However, the fundamental compactness comes from the > > encoding itself: each regular function is described with just 4 bytes > > in the common encoding, compared to .sframe's much larger per-FDE > > overhead. > > The two-level lookup table optimization amplifies this advantage. > > > >>>>> (On macOS you can check the section size with objdump --arch x86_64 - > >>>>> h clang and dump the unwind info with objdump --arch x86_64 --unwind- > >>>>> info clang) > >>>>> > >>>>> OpenVMS's x86-64 port, which is ELF-based, also adopted this format as > >>>>> documented in their "VSI OpenVMS Calling Standard" and their 2018 post: > >>>>> https://discourse.llvm.org/t/rfc-asynchronous-unwind-tables-attribute/59282 > >>>>> > >>>>> The compact unwind format achieves this efficiency through a two-level > >>>>> page table structure. It describes common frame layouts compactly and > >>>>> falls back to DWARF only when necessary, allowing most DWARF CFI entries > >>>>> to be eliminated while maintaining full functionality. For more details, > >>>>> see: https://faultlore.com/blah/compact-unwinding/ and the lld/MachO > >>>>> implemention https://github.com/llvm/llvm-project/blob/main/lld/MachO/ > >>>>> UnwindInfoSection.cpp > >>>>> > >>>> > >>>> How does your vision of "linker-friendly" stack tracing/stack unwinding > >>>> format reconcile with these suggested approaches ? As far as I can tell, > >>>> these formats also require linker created indexes and are > >>>> non-concatenable (custom handling in every linker). Something you've > >>>> had "significant concerns" about. > >>>> > >> > >> This question is unanswered: What do you think about > >> "linker-friendliness" of the current implementation of the lld/MachO > >> implementation of the compact unwind format in LLVM ? > > > > The linker input and output use different section names, so a dumb > > linker would work as long as the runtime accepts the concatenated > > sections. > > > > My vision for an ELF compact unwind format uses separate section names > > for link-time vs. runtime representations. The compiler output format > > should be concatenable, with linker index-building as an optional > > optimization that improves performance but isn't mandatory for > > correctness. > > > > I'll going to add more details > > https://maskray.me/blog/2025-09-28-remarks-on-sframe > > > > > >>> > >>> We can distinguish between linking-time and execution-time > >>> representations by using different section names. > >>> The OpenVMS specification says: > >>> > >>> It is useful to note that the run-time representation of unwind > >>> information can vary from little more than a simple concatenation of > >>> the compile-time information to a substantial rewriting of unwind > >>> information by the linker. The proposal favors simple concatenation > >>> while maintaining the same ordering of groups as their associated > >>> code. > >>> > >>> The runtime library can build this index at runtime and cache it to disk. > >>> > >> > >> This will include the dynamic linker and the stack tracer in the Linux > >> kernel (the latter when stack tracing user space stacks). Do you think > >> this is feasible ? > >> > >>> Once the design becomes sufficiently stable, we can introduce an > >>> opt-in linker option --xxxxframe-index that builds an index from > >>> recognized format versions while reporting warnings for unrecognized > >>> ones.> We need to carefully design this mechanism to be stable and robust, > >>> avoiding frequent format updates. > >>>>> From > >>>> https://docs.vmssoftware.com/vsi-openvms-calling-standard/#STACK_UNWIND_EXCEPTION_X86_64: > >>>> "The unwind dispatch table (see Section B.3.1, ''Unwind Dispatch > >>>> Table'') is created by the linker using information in the unwind > >>>> descriptors (see Section B.3.2, ''DWARF Unwind Descriptors'' and Section > >>>> B.3.3, ''Compact Unwind Description'') provided by compilers. The linker > >>>> may use the provided unwind descriptors directly or replace them with > >>>> equivalent optimized forms based on its optimization strategies." > >>>> > >>>> Above all, do users want a solution which requires falling back on > >>>> DWARF-based processing for precise stack tracing ? > >>> > >>> The key distinction is that compact unwind handles the vast majority > >>> of functions without DWARF—the macOS measurements show __unwind_info > >>> at 0.6% of __text size with __eh_frame reduced to negligible size > >>> (0x58 bytes). While SFrame also cannot handle all frames, compact > >>> unwind achieves dramatic size reductions by making DWARF the exception > >>> rather than requiring it alongside a supplementary format. > >>> > >> > >> As we have tried to reason, this is a misleading comparison. The compact > >> unwind tables format: > >> - needs to be extended for asynchronous stack unwinding > >> - needs to be extended for other ABI/architectures > >> - Making it concatenable / linker-friendly will also likely impose > >> some negative effects on size. > > > > The format supports i386, x86-64, aarch32, and aarch64. The OpenVMS > > proposal demonstrates that supporting asynchronous unwinding is > > straightforward. > > > > Making it linker-friendly does not impose negative effects on the > > output section size. > > > > OK, well, I agree to disagree :) > > Looking forward to some movement on the OpenVMS asynchronous unwind RFC > to see resolution to some of the issues, and some data to back that claim. > > >>> The DWARF fallback provides flexibility for additional coverage when > >>> needed, but nothing is lost (at least for the clang binary on macOS) > >>> if DWARF fallback were disabled in a hypothetical future linux-perf > >>> implementation. > >>> > >> > >> Fair enough, thats something for linux-perf/kernel to decide. Once the > >> OpenVMS RFC is sufficiently shaped to become a viable replacement for > >> .eh_frame, this question will be for the stakeholders to decide. > > > > Agreed. My concern is that .sframe is being deployed before we've > > fully explored whether a more compact and efficient alternative is > > achievable. > > > > > >>>>> **The AArch64 case: size matters even more** > >>>>> > >>>>> The size consideration becomes even more critical for AArch64, which is > >>>>> heavily deployed on mobile phones. > >>>>> There's an active feature request for compact unwind support in the > >>>>> AArch64 ABI: https://github.com/ARM-software/abi-aa/issues/344 > >>>>> This underscores the broader industry need for efficient unwind > >>>>> information that doesn't duplicate data or significantly increase binary > >>>>> size. > >>>>> > >>>> > >>>> Our measurements with a dataset of about 1400 userspace artifacts > >>>> (binaries and shared libraries) show that the SFrame/(EH Frame + EH > >>>> Frame HDR) ratio is: > >>>> - Average of 0.70 on AArch64. > >>>> - Average of 1.00 on x86_64. > >>>> > >>>> Projecting the size of what you observe for clang binary on x86_64 to > >>>> conclude the size ratio on AArch64 is not very wise to do. > >>>> > >>>> Whether the size impact is worth the benefit: its a choice for users to > >>>> make. SFrame offers the users fast, precise stack traces with simple > >>>> stack tracers. > >>> > >>> Thank you for providing the AArch64 measurements. Even with a 0.70x ratio on > >>> AArch64, this represents substantial memory overhead when considering: > >>> > >>> .eh_frame is already large and being complained about. > >>> Being unable to eliminate it (needed for debugging and C++ exceptions) > >>> and adding 0.70x more means significant additional overhead for users. > >>> > >>>>> There are at least two formats the ELF one can learn from: LLVM's > >>>>> compact unwind format (aarch64) and Windows ARM64 Frame Unwind Code. > >>>>> > >>>> > >>>> Please, if you have any concrete suggestions (keeping the above goals in > >>>> mind), you already know how/where to engage. > >>> > >>> I've provided concrete suggestions throughout this discussion. > >>> > >> > >> Apologies, I should have been more precise. And I ask because you know > >> the details about both SFrame and the variants of Compact Unwind > >> Descriptor formats at this point :). If you have concrete suggestions to > >> improve the SFrame format for size, please let us know. > > > > At this point, I'm not certain about specific modifications to .sframe > > itself. I think we should start from scratch, drawing ideas from > > compact unwind information and Windows ARM64. > > > > The existing compact unwind information uses the following 4-byte descriptor: > > > > uint32_t mode_specific_encoding : 24; // vary with different modes > > > > uint32_t mode : 4; // UNWIND_X86_64_MODE_MASK == UNWIND_ARM64_MODE_MASK > > > > uint32_t has_lsda : 1; > > uint32_t personality_index : 2; > > uint32_t is_not_function_start : 1; > > > > Thanks. > > SFrame is not for stack unwinding. Subsuming .eh_frame is topic for > another day. SFrame does not intend to go that route. > > > We probably need a less-restricted version and account for different > > architecture needs. The result would still be significantly smaller > > than SFrame v2 and the future v3 (unless it's completely rewritten). > > > > We should probably design an optional two-level lookup table mechanism > > for additional savings (at the cost of linker friendliness). > > > >>>>> **Path forward** > >>>>> > >>>>> Unless SFrame can actually replace .eh_frame (rather than supplementing > >>>>> it as an accelerator for linux-perf) and demonstrate sizes smaller > >>>>> than .eh_frame - matching the efficiency of existing compact unwind > >>>>> approaches — I question its practical viability for userspace. > >>>>> The current design appears to add overhead rather than reduce it. > >>>>> This isn't to suggest we should simply adopt the existing compact unwind > >>>>> format wholesale. > >>>>> The x86-64 design dates back to 2009 or earlier, and there are likely > >>>>> improvements we can make. However, we should aim for similar or better > >>>>> efficiency gains. > >>>>> > >>>>> For additional context, I've documented my detailed analysis at: > >>>>> > >>>>> - https://maskray.me/blog/2025-09-28-remarks-on-sframe (covering > >>>>> mandatory index building problems, section group compliance and garbage > >>>>> collection issues, and version compatibility challenges) > >>>> > >>>> GC issue is a bug currently tracked and with a target milestone of 2.46. > >>>>> - https://maskray.me/blog/2025-10-26-stack-walking-space-and-time-trade- > >>>>> offs (size analysis) > >>>>> > > > > The GC issue would not have happened at all if we had used multiple > > sections and thought about ELF and linker convention :) > > Thanks for engaging. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Concerns about SFrame viability for userspace stack walking 2025-10-30 6:53 Concerns about SFrame viability for userspace stack walking Fangrui Song ` (3 preceding siblings ...) 2025-11-04 9:21 ` Indu @ 2025-12-01 9:04 ` Fangrui Song 4 siblings, 0 replies; 30+ messages in thread From: Fangrui Song @ 2025-12-01 9:04 UTC (permalink / raw) To: linux-toolchains, linux-perf-users, linux-kernel On 2025-10-29, Fangrui Song wrote: >I've been following the SFrame discussion and wanted to share some concerns about its viability for userspace adoption, based on concrete measurements and comparison with existing compact unwind implementations in LLVM. > >**Size overhead concerns** > >Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is approximately 10% larger than the combined size of .eh_frame and .eh_frame_hdr (8.06 MiB total). >This is problematic because .eh_frame cannot be eliminated - it contains essential information for restoring callee-saved registers, LSDA, and personality information needed for debugging (e.g. reading local variables in a coredump) and C++ exception handling. > >This means adopting SFrame would result in carrying both formats, with a large net size increase. > >**Learning from existing compact unwind implementations** > >It's worth noting that LLVM has had a battle-tested compact unwind format in production use since 2009 with OS X 10.6, which transitioned to using CFI directives in 2013 [1]. The efficiency gains are dramatic: > > __text section: 0x4a55470 bytes > __unwind_info section: 0x79060 bytes (0.6% of __text) > __eh_frame section: 0x58 bytes > > (On macOS you can check the section size with objdump --arch x86_64 -h clang and dump the unwind info with objdump --arch x86_64 --unwind-info clang) > >OpenVMS's x86-64 port, which is ELF-based, also adopted this format as documented in their "VSI OpenVMS Calling Standard" and their 2018 post: https://discourse.llvm.org/t/rfc-asynchronous-unwind-tables-attribute/59282 > >The compact unwind format achieves this efficiency through a two-level page table structure. It describes common frame layouts compactly and falls back to DWARF only when necessary, allowing most DWARF CFI entries to be eliminated while maintaining full functionality. For more details, see: https://faultlore.com/blah/compact-unwinding/ and the lld/MachO implemention https://github.com/llvm/llvm-project/blob/main/lld/MachO/UnwindInfoSection.cpp > >**The AArch64 case: size matters even more** > >The size consideration becomes even more critical for AArch64, which is heavily deployed on mobile phones. >There's an active feature request for compact unwind support in the AArch64 ABI: https://github.com/ARM-software/abi-aa/issues/344 >This underscores the broader industry need for efficient unwind information that doesn't duplicate data or significantly increase binary size. > >There are at least two formats the ELF one can learn from: LLVM's compact unwind format (aarch64) and Windows ARM64 Frame Unwind Code. > >**Path forward** > >Unless SFrame can actually replace .eh_frame (rather than supplementing it as an accelerator for linux-perf) and demonstrate sizes smaller than .eh_frame - matching the efficiency of existing compact unwind approaches — I question its practical viability for userspace. >The current design appears to add overhead rather than reduce it. >This isn't to suggest we should simply adopt the existing compact unwind format wholesale. >The x86-64 design dates back to 2009 or earlier, and there are likely improvements we can make. However, we should aim for similar or better efficiency gains. > >For additional context, I've documented my detailed analysis at: > >- https://maskray.me/blog/2025-09-28-remarks-on-sframe (covering mandatory index building problems, section group compliance and garbage collection issues, and version compatibility challenges) >- https://maskray.me/blog/2025-10-26-stack-walking-space-and-time-trade-offs (size analysis) > >Best regards, >Fangrui > >[1]: https://github.com/llvm/llvm-project/commit/58e2d3d856b7dc7b97a18cfa2aeeb927bc7e6bd5 ("Generate compact unwind encoding from CFI directives.") > tl;dr I believe a compact unwind scheme demonstrates significant promise over SFrame. The MIPS compact exception tables as implemented in Binutils is also worth considering (the structure can be shared among all architectures while unwind code has to be arch-specific) I've ported the Mach-O compact unwind format to ELF in a branch, establishing a baseline for improvements to the compact unwind format. ``` % ~/Dev/object-file-size-analyzer/section_size.rb /tmp/out/custom-{fp,sframe,compact,fp-gcc,sframe-gcc}/bin/{llvm-mc,opt} Filename | .text size | EH size | .sframe size | VM size | VM increase ---------------------------------------+------------------+----------------+----------------+----------+------------ /tmp/out/custom-fp/bin/llvm-mc | 2120895 (23.5%) | 301528 (3.3%) | 0 (0.0%) | 9043221 | - /tmp/out/custom-sframe/bin/llvm-mc | 2109231 (22.3%) | 367424 (3.9%) | 348041 (3.7%) | 9474085 | +4.8% /tmp/out/custom-compact/bin/llvm-mc | 2109519 (24.4%) | 106288 (1.2%) | 0 (0.0%) | 8639637 | -4.5% /tmp/out/custom-fp-gcc/bin/llvm-mc | 2744214 (29.2%) | 301836 (3.2%) | 0 (0.0%) | 9389677 | +3.8% /tmp/out/custom-sframe-gcc/bin/llvm-mc | 2705860 (27.7%) | 354292 (3.6%) | 356073 (3.6%) | 9780985 | +8.2% /tmp/out/custom-fp/bin/opt | 38769545 (69.9%) | 3547688 (6.4%) | 0 (0.0%) | 55425217 | - /tmp/out/custom-sframe/bin/opt | 38891295 (62.4%) | 4559644 (7.3%) | 4448874 (7.1%) | 62292133 | +12.4% /tmp/out/custom-compact/bin/opt | 38898415 (74.8%) | 1200764 (2.3%) | 0 (0.0%) | 52020449 | -6.1% /tmp/out/custom-fp-gcc/bin/opt | 54654215 (78.1%) | 3631196 (5.2%) | 0 (0.0%) | 70001373 | +26.3% /tmp/out/custom-sframe-gcc/bin/opt | 53644895 (70.4%) | 4857364 (6.4%) | 5263676 (6.9%) | 76206149 | +37.5% ``` **Evaluation results** With the current implementation, 4937 out of 77648 FDEs (6.36%) require a DWARF escape, while the remaining FDEs can be replaced with unwind descriptors, yielding a huge size saving. .eh_frame_hdr will become significantly smaller if we implement a two-level page table structure similar to Mach-O __unwind_info to deduplicate entries. **Build configurations** ``` #!/bin/zsh conf() { configure-llvm $1 -DCMAKE_EXE_LINKER_FLAGS='-fuse-ld=bfd -pie -Wl,-z,pack-relative-relocs' \ -DCMAKE_SHARED_LINKER_FLAGS=-fuse-ld=bfd -DLLVM_ENABLE_UNWIND_TABLES=on -DLLVM_ENABLE_LLD=off ${@:2} } clang=(-DCMAKE_CXX_COMPILER=/tmp/Rel/bin/clang++ -DCMAKE_C_COMPILER=/tmp/Rel/bin/clang) gcc=("-DCMAKE_C_COMPILER=$HOME/opt/gcc-15/bin/gcc" "-DCMAKE_CXX_COMPILER=$HOME/opt/gcc-15/bin/g++") compact="-fomit-frame-pointer -momit-leaf-frame-pointer -B$HOME/opt/binutils/bin -mllvm -elf-compact-unwind -mllvm -x86-epilog-cfi=0" fp="-fno-omit-frame-pointer -momit-leaf-frame-pointer -B$HOME/opt/binutils/bin -Wa,--gsframe=no" sframe="-fomit-frame-pointer -momit-leaf-frame-pointer -B$HOME/opt/binutils/bin -Wa,--gsframe" conf custom-compact -DCMAKE_{C,CXX}_FLAGS="$compact" ${clang[@]} \ -DCMAKE_EXE_LINKER_FLAGS='-fuse-ld=lld -pie -Wl,-z,pack-relative-relocs' \ -DCMAKE_SHARED_LINKER_FLAGS=-fuse-ld=lld conf custom-fp -DCMAKE_{C,CXX}_FLAGS="-fno-integrated-as $fp" ${clang[@]} conf custom-sframe -DCMAKE_{C,CXX}_FLAGS="-fno-integrated-as $sframe" ${clang[@]} conf custom-fp-gcc -DCMAKE_{C,CXX}_FLAGS="$fp" ${gcc[@]} conf custom-sframe-gcc -DCMAKE_{C,CXX}_FLAGS="$sframe" ${gcc[@]} for i in compact fp sframe fp-gcc sframe-gcc; do ninja -C /tmp/out/custom-$i llvm-mc opt; done ``` The `/tmp/out/custom-compact` build uses my llvm-project branch (<http://github.com/MaskRay/llvm-project/tree/demo-unwind>) that ports Mach-O compact unwind to ELF, allowing the majority of `.eh_frame` FDEs to replace CFI instructions with unwind descriptors. -mllvm -x86-epilog-cfi=0: Disables epilogue CFI for x86 (primarily implemented by D42848 in 2018, notably disabled for Darwin and Windows). Without this option most frames will not utilize unwind descriptors because the current Mach-O compact unwind implementation does not support popq %rbp; .cfi_def_cfa %rsp, 8; ret. I believe this is still fair as we expect to use a 8-byte descriptor, sufficient to describe epilogue CFI. If you still think custom-compact using -x86-epilog-cfi is not entirely fair to other builds, this is the table using -fno-asynchronous-unwind-tables -funwind-tables -mllvm -x86-epilog-cfi=0 for all builds: % ~/Dev/object-file-size-analyzer/section_size.rb /tmp/out/custom-{fp-sync,sframe-sync,compact-sync}/bin/{llvm-mc,opt} Filename | .text size | EH size | .sframe size | VM size | VM increase -----------------------------------------+------------------+----------------+----------------+----------+------------ /tmp/out/custom-fp-sync/bin/llvm-mc | 2120895 (24.1%) | 263396 (3.0%) | 0 (0.0%) | 8802093 | - /tmp/out/custom-sframe-sync/bin/llvm-mc | 2109231 (23.2%) | 291084 (3.2%) | 248654 (2.7%) | 9090325 | +3.3% /tmp/out/custom-compact-sync/bin/llvm-mc | 2109519 (24.4%) | 106288 (1.2%) | 0 (0.0%) | 8639637 | -1.8% /tmp/out/custom-fp-sync/bin/opt | 38769545 (72.2%) | 2997572 (5.6%) | 0 (0.0%) | 53706041 | - /tmp/out/custom-sframe-sync/bin/opt | 38891295 (66.9%) | 3425116 (5.9%) | 2951292 (5.1%) | 58091421 | +8.2% /tmp/out/custom-compact-sync/bin/opt | 38898415 (74.8%) | 1200764 (2.3%) | 0 (0.0%) | 52020449 | -3.1% --- After I had implemented this, I then investigated the MIPS compact exception tables. I can now finalize the ‘in construction’ chapter of my blog post, https://maskray.me/blog/2020-11-08-stack-unwinding#mips-compact-exception-tables Designed around 2015, it is actually a very good format. Compiler output. The directive .cfi_sections .eh_frame_entry instructs the assembler to emit index table entries to the .eh_frame_entry section. .cfi_fde_data opcode1, ... betweens a pair of .cfi_startproc and .cfi_endproc describes the frame unwind opcodes where each opcode takes one byte. The frame unwind opcodes describes the semantics of prologue instructions, similar to Windows ARM64 Frame Unwind Codes. Assembler processing. The assembler generates a .eh_frame_entry.* section for each section with compact unwind information. Each .eh_frame_entry is a pair of 4 bytes, where the first word is like the first word in a .eh_frame_hdr entry. An .eh_frame_entry entry takes one of three forms: Inline compact: (even pc, unwind_data). This form can be used when there are at most 3 opcodes (3 bytes) and no personality routine. Out-of-line compact: (odd pc, even unwind_ptr) where unwind_ptr points to unwind data in the .gnu_extab section. Legacy: (odd pc, odd legacy_unwind_ptr) where legacy_unwind_ptr points to the legacy .eh_frame section. TODO: Describe .cfi_inline_lsda, which appears related to __gnu_compact_pr[1-3]. Linker processing. GNU ld concatenates .eh_frame_entry and .eh_frame_entry.* sections, sorting them by address. The following internal linker script fragment adds a header before the entries: .eh_frame_hdr : { *(.eh_frame_hdr) *(.eh_frame_entry .eh_frame_entry.*) } Although the section name remains the traditional .eh_frame_hdr, the version is set to 2. The linker also defines the symbol __GNU_EH_FRAME_HDR to hold the .eh_frame_hdr address. --- I've studied numerous stack unwinding/walking formats. DWARF CFI is essential to achieve near 100% coverage. Other formats, such as compact unwind formats and SFrame, have limitations. The ideal future solution, as an alternative to frame pointer chains, will be a stack unwinding format that supports C++ exceptions and can use DWARF CFI as a fallback. ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2025-12-01 9:03 UTC | newest] Thread overview: 30+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-10-30 6:53 Concerns about SFrame viability for userspace stack walking Fangrui Song 2025-10-30 7:30 ` Jakub Jelinek 2025-10-30 7:50 ` Fangrui Song 2025-10-30 8:05 ` Jakub Jelinek 2025-10-31 2:51 ` Fangrui Song 2025-10-30 10:26 ` Peter Zijlstra 2025-10-30 16:48 ` Fangrui Song 2025-10-30 17:03 ` Jose E. Marchesi 2025-10-31 4:22 ` Fangrui Song 2025-10-31 14:37 ` Jose E. Marchesi 2025-10-30 17:33 ` Steven Rostedt 2025-10-31 5:28 ` Fangrui Song 2025-10-30 18:22 ` Peter Zijlstra 2025-10-30 17:53 ` Andi Kleen 2025-10-30 18:07 ` Mark Brown 2025-10-30 18:31 ` Andi Kleen 2025-10-30 18:45 ` Mark Brown 2025-10-31 8:24 ` Fangrui Song 2025-10-30 18:57 ` Peter Zijlstra 2025-10-31 11:46 ` Mark Brown 2025-10-30 14:47 ` Jose E. Marchesi 2025-11-04 9:21 ` Indu 2025-11-05 8:21 ` Fangrui Song 2025-11-06 0:44 ` Indu Bhagat 2025-11-06 7:51 ` Florian Weimer 2025-11-06 21:02 ` Indu Bhagat 2025-11-06 9:20 ` Fangrui Song 2025-11-06 20:42 ` Indu Bhagat 2025-11-09 0:23 ` Fangrui Song 2025-12-01 9:04 ` Fangrui Song
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).