Unaligned access trade-offs for SFrame FRE layout

linux-toolchains.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Unaligned access trade-offs for SFrame FRE layout
@ 2025-09-12 17:34 Indu Bhagat
  2025-09-12 18:19 ` Segher Boessenkool
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Indu Bhagat @ 2025-09-12 17:34 UTC (permalink / raw)
  To: Jens Remus, Sterling Augustine, Pavel Labath, Andrii Nakryiko,
	Josh Poimboeuf, Steven Rostedt, Serhei Makarov, Binutils
  Cc: linux-toolchains@vger.kernel.org

TL;DR: Thinking and experimenting a bit on the possible approaches for 
avoiding unaligned accesses in the SFrame FRE layout (in SFrame V3), I 
am not convinced that avoiding unaligned accesses for performance is 
worth it.  IMO, forsaking compactness for avoiding unaligned accesses is 
not a good trade off for SFrame.

Problem Statement
On architectures such as x86_64, AArch64, and s390x, unaligned memory 
accesses are handled transparently by the hardware but incur a 
performance penalty. The objective of this analysis is to evaluate if 
these unaligned accesses can be eliminated from the SFrame FRE layout 
and if doing so provides a net performance benefit.

The central challenge is that any alternative must demonstrate a clear
performance improvement while avoiding significant size overhead. 
Introducing "bloat" to the format to solve a potential performance issue 
is a poor trade-off.

Source of unaligned accesses in SFrame FRE
  - (#1) Access to the SFrame FRE start address (sfre_start_address)
  - (#2) Access to the SFrame FRE stack offsets,  This is varlen data 
tailing SFrame FRE top-level members (sfre_start_address and FRE info), 
usually interpreted as stack offsets)

(Note that in the SFrame specification, SFrame Header, and SFrame FDE 
(function descriptor entry) have aligned accesses.)

Updated notes on the various approaches and respective evaluation notes 
on the wiki page:
https://sourceware.org/binutils/wiki/sframe/sframev3todo#Avoid_unaligned_accesses

Summary of Approaches and Analysis/Notes
Unaligned accesses may mean lower performance, but the alternative we 
pick must at least provide better performance.  It is also important 
that the chosen approach does not add bloat to the format.  Avoiding 
unaligned accesses at the expense of bloating up the format is not a 
good idea IMO.

Approach 1a: Bucketed members
  Pros: Negligible bloat.
  Cons: 1. Writing out the FRE data is somewhat more involved. Affects
   assemblers, linkers. 2. For the common case though, accessing stack 
offsets now needs more memory accesses per FRE.  This approach will not 
bring clear performance benefits; the additional complexity in SFrame 
readers and writers is not justified then either.

Approach 1b: Bucketed members with Index
  Cons: Significant bloat (~30%).

Approach 2: De-duplicated "stack offsets"
  Pros: Will help reduce the size of SFrame sections.
  Cons: 1. SFrame FRE layout is designed to be flexible so that it can
   serve needs of new ABIs:  The varlen data is interpreted as stack 
offsets on x86_64, and AArch64, but may not be the case for other ABIs. 
De-duplicating non-structured data is not meaningful. 2. Writing out the 
FRE data is quite more involved, increasing the complexity in Toolchain.

Approach 3: Good old basic padding
  Cons: Significant bloat (~22%).  Performance win arguable as well.

IMO, none of these approaches provide viable way to move forward. The 
proposed methods either fail to deliver the desired clear performance 
gain or introduce a significant size penalty or complexity, which is an 
unacceptable trade-off.

Would like to gather inputs from the interested folks on this. Please 
take a look and chime in.  Other ideas welcome.

Thanks

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Unaligned access trade-offs for SFrame FRE layout
  2025-09-12 17:34 Unaligned access trade-offs for SFrame FRE layout Indu Bhagat
@ 2025-09-12 18:19 ` Segher Boessenkool
  2025-09-12 19:18 ` Steven Rostedt
  2025-09-14 14:14 ` Jan Beulich
  2 siblings, 0 replies; 26+ messages in thread
From: Segher Boessenkool @ 2025-09-12 18:19 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: Jens Remus, Sterling Augustine, Pavel Labath, Andrii Nakryiko,
	Josh Poimboeuf, Steven Rostedt, Serhei Makarov, Binutils,
	linux-toolchains@vger.kernel.org

On Fri, Sep 12, 2025 at 10:34:42AM -0700, Indu Bhagat wrote:
Why on earth would you ever care about an "unaligned" access?  The ABI
requires you to support sny of it.

You might want it to be faster, but that is very very marginal no matter
what.


Segher

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Unaligned access trade-offs for SFrame FRE layout
  2025-09-12 17:34 Unaligned access trade-offs for SFrame FRE layout Indu Bhagat
  2025-09-12 18:19 ` Segher Boessenkool
@ 2025-09-12 19:18 ` Steven Rostedt
  2025-09-13  7:56   ` Indu Bhagat
       [not found]   ` <CAEG7qUxk_cZYv3X_VM6+ZGaVFAD-7jdPd3xA92xYHUAqyzb2Xw@mail.gmail.com>
  2025-09-14 14:14 ` Jan Beulich
  2 siblings, 2 replies; 26+ messages in thread
From: Steven Rostedt @ 2025-09-12 19:18 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: Jens Remus, Sterling Augustine, Pavel Labath, Andrii Nakryiko,
	Josh Poimboeuf, Serhei Makarov, Binutils,
	linux-toolchains@vger.kernel.org

On Fri, 12 Sep 2025 10:34:42 -0700
Indu Bhagat <indu.bhagat@oracle.com> wrote:

> TL;DR: Thinking and experimenting a bit on the possible approaches for 
> avoiding unaligned accesses in the SFrame FRE layout (in SFrame V3), I 
> am not convinced that avoiding unaligned accesses for performance is 
> worth it.  IMO, forsaking compactness for avoiding unaligned accesses is 
> not a good trade off for SFrame.
> 
> Problem Statement
> On architectures such as x86_64, AArch64, and s390x, unaligned memory 
> accesses are handled transparently by the hardware but incur a 
> performance penalty. The objective of this analysis is to evaluate if 
> these unaligned accesses can be eliminated from the SFrame FRE layout 
> and if doing so provides a net performance benefit.

I guess the question is really, is it that big of a performance hit?

I know some others were worried about the performance, but we should look
at measurements too. Is it going to be a big enough issue in the stack
unwinding code to even notice?

> 
> The central challenge is that any alternative must demonstrate a clear
> performance improvement while avoiding significant size overhead. 
> Introducing "bloat" to the format to solve a potential performance issue 
> is a poor trade-off.

Correct. I would like to see performance numbers before we invest too much
time in this.

> 
> Source of unaligned accesses in SFrame FRE
>   - (#1) Access to the SFrame FRE start address (sfre_start_address)
>   - (#2) Access to the SFrame FRE stack offsets,  This is varlen data 
> tailing SFrame FRE top-level members (sfre_start_address and FRE info), 
> usually interpreted as stack offsets)

BTW, we should also look at how often are there unaligned accesses? All the
time? or just a percentage of time? If it is a percentage, what is that
percentage?

> 
> (Note that in the SFrame specification, SFrame Header, and SFrame FDE 
> (function descriptor entry) have aligned accesses.)
> 
> Updated notes on the various approaches and respective evaluation notes 
> on the wiki page:
> https://sourceware.org/binutils/wiki/sframe/sframev3todo#Avoid_unaligned_accesses
> 
> Summary of Approaches and Analysis/Notes
> Unaligned accesses may mean lower performance, but the alternative we 
> pick must at least provide better performance.  It is also important 
> that the chosen approach does not add bloat to the format.  Avoiding 
> unaligned accesses at the expense of bloating up the format is not a 
> good idea IMO.
> 
> Approach 1a: Bucketed members
>   Pros: Negligible bloat.
>   Cons: 1. Writing out the FRE data is somewhat more involved. Affects
>    assemblers, linkers. 2. For the common case though, accessing stack 
> offsets now needs more memory accesses per FRE.  This approach will not 
> bring clear performance benefits; the additional complexity in SFrame 
> readers and writers is not justified then either.

Right. If this causes more cache misses or worse, more page faults, to save
from an unaligned access, I don't think it's worth it.

> 
> Approach 1b: Bucketed members with Index
>   Cons: Significant bloat (~30%).

I personally believe 30% is too much overhead.

> 
> Approach 2: De-duplicated "stack offsets"
>   Pros: Will help reduce the size of SFrame sections.
>   Cons: 1. SFrame FRE layout is designed to be flexible so that it can
>    serve needs of new ABIs:  The varlen data is interpreted as stack 
> offsets on x86_64, and AArch64, but may not be the case for other ABIs. 
> De-duplicating non-structured data is not meaningful. 2. Writing out the 
> FRE data is quite more involved, increasing the complexity in Toolchain.

I don't know enough to comment about the above.

> 
> Approach 3: Good old basic padding
>   Cons: Significant bloat (~22%).  Performance win arguable as well.

I think 22% is also too much.

> 
> IMO, none of these approaches provide viable way to move forward. The 
> proposed methods either fail to deliver the desired clear performance 
> gain or introduce a significant size penalty or complexity, which is an 
> unacceptable trade-off.
> 
> Would like to gather inputs from the interested folks on this. Please 
> take a look and chime in.  Other ideas welcome.

As stated above, I'd like to know how much of a performance benefit this
is. It may not be worth it.

I wasn't one of the people who brought up unaligned accesses. I'd like to
hear from them to get their input.

-- Steve

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Unaligned access trade-offs for SFrame FRE layout
  2025-09-12 19:18 ` Steven Rostedt
@ 2025-09-13  7:56   ` Indu Bhagat
  2025-09-15 16:04     ` Steven Rostedt
       [not found]   ` <CAEG7qUxk_cZYv3X_VM6+ZGaVFAD-7jdPd3xA92xYHUAqyzb2Xw@mail.gmail.com>
  1 sibling, 1 reply; 26+ messages in thread
From: Indu Bhagat @ 2025-09-13  7:56 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jens Remus, Sterling Augustine, Pavel Labath, Andrii Nakryiko,
	Josh Poimboeuf, Serhei Makarov, Binutils,
	linux-toolchains@vger.kernel.org

On 9/12/25 12:18 PM, Steven Rostedt wrote:
> On Fri, 12 Sep 2025 10:34:42 -0700
> Indu Bhagat <indu.bhagat@oracle.com> wrote:
> 
>> TL;DR: Thinking and experimenting a bit on the possible approaches for
>> avoiding unaligned accesses in the SFrame FRE layout (in SFrame V3), I
>> am not convinced that avoiding unaligned accesses for performance is
>> worth it.  IMO, forsaking compactness for avoiding unaligned accesses is
>> not a good trade off for SFrame.
>>
>> Problem Statement
>> On architectures such as x86_64, AArch64, and s390x, unaligned memory
>> accesses are handled transparently by the hardware but incur a
>> performance penalty. The objective of this analysis is to evaluate if
>> these unaligned accesses can be eliminated from the SFrame FRE layout
>> and if doing so provides a net performance benefit.
> 
> I guess the question is really, is it that big of a performance hit?
> 
> I know some others were worried about the performance, but we should look
> at measurements too. Is it going to be a big enough issue in the stack
> unwinding code to even notice?
> 

I think quantifying the performance impact of unaligned accesses for 
stack tracing using SFrame sections will be larger experiment which will 
be hardware dependent..

https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/

It seems, for newer architectures, if the unaligned access is to the 
same cache line, the cycle impact is minimal.  When the unaligned access 
crosses cache line boundary, there may be a few cycles of impact. When 
unaligned accesses cross page boundary, it gets noticeable.

That said, I can give some static numbers from some SFrame sections on 
x86_64 for now.  I see that across the SFrame sections for GNU Binutils 
binaries (-O2 binaries):
   - ~30% of SFrame FRE start addr across all functions are unaligned[*]
   - ~4% of all stack offsets are unaligned.

[*] Caveat: This should not be construed to mean that 1 out of every 3 
SFrame FRE start addr are unaligned.  There may be functions where 
SFrame FRE start addr are all aligned (e.g., because they were 1-byte 
long). Above data is average across all functions in one binary.

>>
>> The central challenge is that any alternative must demonstrate a clear
>> performance improvement while avoiding significant size overhead.
>> Introducing "bloat" to the format to solve a potential performance issue
>> is a poor trade-off.
> 
> Correct. I would like to see performance numbers before we invest too much
> time in this.
> 
>>
>> Source of unaligned accesses in SFrame FRE
>>    - (#1) Access to the SFrame FRE start address (sfre_start_address)
>>    - (#2) Access to the SFrame FRE stack offsets,  This is varlen data
>> tailing SFrame FRE top-level members (sfre_start_address and FRE info),
>> usually interpreted as stack offsets)
> 
> BTW, we should also look at how often are there unaligned accesses? All the
> time? or just a percentage of time? If it is a percentage, what is that
> percentage?
> 

The stack offsets for an FRE are accessed once per frame (and an SFrame 
FRE may have an average of 2 stack offsets on x86_64).

WRT FRE start addr, multiple SFrame FRE start address may need to be 
read until the applicable SFrame FRE is found.  SFrame FREs lookup is 
serial. SFrame FRE start addr can be 1-byte/2-byte or 4-byte (one size 
chosen per function).

The larger point I was trying to make was:  The alternative layouts of 
SFrame FREs may fair worse in performance or compactness or both...  So 
either way avoiding unaligned accesses does not look feasible with any 
of those approaches..

>>
>> (Note that in the SFrame specification, SFrame Header, and SFrame FDE
>> (function descriptor entry) have aligned accesses.)
>>
>> Updated notes on the various approaches and respective evaluation notes
>> on the wiki page:
>> https://sourceware.org/binutils/wiki/sframe/sframev3todo#Avoid_unaligned_accesses
>>
>> Summary of Approaches and Analysis/Notes
>> Unaligned accesses may mean lower performance, but the alternative we
>> pick must at least provide better performance.  It is also important
>> that the chosen approach does not add bloat to the format.  Avoiding
>> unaligned accesses at the expense of bloating up the format is not a
>> good idea IMO.
>>
>> Approach 1a: Bucketed members
>>    Pros: Negligible bloat.
>>    Cons: 1. Writing out the FRE data is somewhat more involved. Affects
>>     assemblers, linkers. 2. For the common case though, accessing stack
>> offsets now needs more memory accesses per FRE.  This approach will not
>> bring clear performance benefits; the additional complexity in SFrame
>> readers and writers is not justified then either.
> 
> Right. If this causes more cache misses or worse, more page faults, to save
> from an unaligned access, I don't think it's worth it.
> 
>>
>> Approach 1b: Bucketed members with Index
>>    Cons: Significant bloat (~30%).
> 
> I personally believe 30% is too much overhead.
> 
>>
>> Approach 2: De-duplicated "stack offsets"
>>    Pros: Will help reduce the size of SFrame sections.
>>    Cons: 1. SFrame FRE layout is designed to be flexible so that it can
>>     serve needs of new ABIs:  The varlen data is interpreted as stack
>> offsets on x86_64, and AArch64, but may not be the case for other ABIs.
>> De-duplicating non-structured data is not meaningful. 2. Writing out the
>> FRE data is quite more involved, increasing the complexity in Toolchain.
> 
> I don't know enough to comment about the above.
> 
>>
>> Approach 3: Good old basic padding
>>    Cons: Significant bloat (~22%).  Performance win arguable as well.
> 
> I think 22% is also too much.
> 
>>
>> IMO, none of these approaches provide viable way to move forward. The
>> proposed methods either fail to deliver the desired clear performance
>> gain or introduce a significant size penalty or complexity, which is an
>> unacceptable trade-off.
>>
>> Would like to gather inputs from the interested folks on this. Please
>> take a look and chime in.  Other ideas welcome.
> 
> As stated above, I'd like to know how much of a performance benefit this
> is. It may not be worth it.
> 
> I wasn't one of the people who brought up unaligned accesses. I'd like to
> hear from them to get their input.
> 
> -- Steve


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Unaligned access trade-offs for SFrame FRE layout
       [not found]   ` <CAEG7qUxk_cZYv3X_VM6+ZGaVFAD-7jdPd3xA92xYHUAqyzb2Xw@mail.gmail.com>
@ 2025-09-13  8:01     ` Indu Bhagat
  0 siblings, 0 replies; 26+ messages in thread
From: Indu Bhagat @ 2025-09-13  8:01 UTC (permalink / raw)
  To: Sterling Augustine, Steven Rostedt
  Cc: Jens Remus, Pavel Labath, Andrii Nakryiko, Josh Poimboeuf,
	Serhei Makarov, Binutils, linux-toolchains@vger.kernel.org

On 9/12/25 1:17 PM, Sterling Augustine wrote:
> Agreed with the others that optimizing for size is preferred, but that 
> it would be very nice to know just how much performance we are talking 
> about.
> 
> I wonder if a partial solution would be enough to alleviate the concerns.
> 
> Some of the unaligned accesses can be fixed without a change to the 
> specification at all. An implementation could align just the first 
> sframe_row_entry for every function. That first entry is accessed every 
> time a function is unwound, so this idea reduces the number of unaligned 
> accesses by about one per frame.
> 

Yes, I too thought about that.  And correct, that should be easily doable.

> Not sure what the size penalty would be, but probably quite a bit less 
> than 22%.

Much less than 22%. I would expect it to be quite low. A few bytes per 
FDE essentially.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Unaligned access trade-offs for SFrame FRE layout
  2025-09-12 17:34 Unaligned access trade-offs for SFrame FRE layout Indu Bhagat
  2025-09-12 18:19 ` Segher Boessenkool
  2025-09-12 19:18 ` Steven Rostedt
@ 2025-09-14 14:14 ` Jan Beulich
  2025-09-14 14:39   ` Rainer Orth
  2 siblings, 1 reply; 26+ messages in thread
From: Jan Beulich @ 2025-09-14 14:14 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: linux-toolchains@vger.kernel.org, Jens Remus, Sterling Augustine,
	Pavel Labath, Andrii Nakryiko, Josh Poimboeuf, Steven Rostedt,
	Serhei Makarov, Binutils

On 12.09.2025 19:34, Indu Bhagat via Binutils wrote:
> TL;DR: Thinking and experimenting a bit on the possible approaches for 
> avoiding unaligned accesses in the SFrame FRE layout (in SFrame V3), I 
> am not convinced that avoiding unaligned accesses for performance is 
> worth it.  IMO, forsaking compactness for avoiding unaligned accesses is 
> not a good trade off for SFrame.
> 
> Problem Statement
> On architectures such as x86_64, AArch64, and s390x, unaligned memory 
> accesses are handled transparently by the hardware but incur a 
> performance penalty.

As you say in a reply, may incur. However, shouldn't we also consider
possible ports of SFrame to architectures which don't handle this as
transparently? Off the top of my head I don't, for example, recall
whether RISC-V requires unaligned accesses to be handled transparently
by the hardware.

Jan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Unaligned access trade-offs for SFrame FRE layout
  2025-09-14 14:14 ` Jan Beulich
@ 2025-09-14 14:39   ` Rainer Orth
  2025-09-14 15:23     ` Jan Beulich
  0 siblings, 1 reply; 26+ messages in thread
From: Rainer Orth @ 2025-09-14 14:39 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Indu Bhagat, linux-toolchains@vger.kernel.org, Jens Remus,
	Sterling Augustine, Pavel Labath, Andrii Nakryiko, Josh Poimboeuf,
	Steven Rostedt, Serhei Makarov, Binutils

Hi Jan,

> On 12.09.2025 19:34, Indu Bhagat via Binutils wrote:
>> TL;DR: Thinking and experimenting a bit on the possible approaches for 
>> avoiding unaligned accesses in the SFrame FRE layout (in SFrame V3), I 
>> am not convinced that avoiding unaligned accesses for performance is 
>> worth it.  IMO, forsaking compactness for avoiding unaligned accesses is 
>> not a good trade off for SFrame.
>> 
>> Problem Statement
>> On architectures such as x86_64, AArch64, and s390x, unaligned memory 
>> accesses are handled transparently by the hardware but incur a 
>> performance penalty.
>
> As you say in a reply, may incur. However, shouldn't we also consider
> possible ports of SFrame to architectures which don't handle this as
> transparently? Off the top of my head I don't, for example, recall
> whether RISC-V requires unaligned accesses to be handled transparently
> by the hardware.

look for STRICT_ALIGNMENT in the GCC sources in gcc/config.  While
several are embedded targets, there's also sparc in that list.

Getting SIGBUS on SPARC is a good reality check for code that doesn't
take that into accout.

	Rainer

-- 
-----------------------------------------------------------------------------
Rainer Orth, Center for Biotechnology, Bielefeld University

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Unaligned access trade-offs for SFrame FRE layout
  2025-09-14 14:39   ` Rainer Orth
@ 2025-09-14 15:23     ` Jan Beulich
  2025-09-14 16:18       ` Rainer Orth
  2025-09-15  9:08       ` Segher Boessenkool
  0 siblings, 2 replies; 26+ messages in thread
From: Jan Beulich @ 2025-09-14 15:23 UTC (permalink / raw)
  To: Rainer Orth
  Cc: Indu Bhagat, linux-toolchains@vger.kernel.org, Jens Remus,
	Sterling Augustine, Pavel Labath, Andrii Nakryiko, Josh Poimboeuf,
	Steven Rostedt, Serhei Makarov, Binutils

On 14.09.2025 16:39, Rainer Orth wrote:
>> On 12.09.2025 19:34, Indu Bhagat via Binutils wrote:
>>> TL;DR: Thinking and experimenting a bit on the possible approaches for 
>>> avoiding unaligned accesses in the SFrame FRE layout (in SFrame V3), I 
>>> am not convinced that avoiding unaligned accesses for performance is 
>>> worth it.  IMO, forsaking compactness for avoiding unaligned accesses is 
>>> not a good trade off for SFrame.
>>>
>>> Problem Statement
>>> On architectures such as x86_64, AArch64, and s390x, unaligned memory 
>>> accesses are handled transparently by the hardware but incur a 
>>> performance penalty.
>>
>> As you say in a reply, may incur. However, shouldn't we also consider
>> possible ports of SFrame to architectures which don't handle this as
>> transparently? Off the top of my head I don't, for example, recall
>> whether RISC-V requires unaligned accesses to be handled transparently
>> by the hardware.
> 
> look for STRICT_ALIGNMENT in the GCC sources in gcc/config.  While
> several are embedded targets, there's also sparc in that list.

But is this setting a good reference for the purpose here. For RISC-V it's
command line (?) controlled (TARGET_STRICT_ALIGN), despite the spec saying

"An EEI may not guarantee misaligned loads and stores are handled invisibly.
 In this case, loads and stores that are not naturally aligned may either
 complete execution successfully or raise an exception. The exception raised
 can be either an address-misaligned exception or an access-fault exception."

It's okay for gcc to make assumptions (assuming they're properly documented),
but I don't think such assumptions can be extended to a discussion like the
one here.

Jan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Unaligned access trade-offs for SFrame FRE layout
  2025-09-14 15:23     ` Jan Beulich
@ 2025-09-14 16:18       ` Rainer Orth
  2025-09-14 18:10         ` Jan Beulich
  2025-09-15  9:08       ` Segher Boessenkool
  1 sibling, 1 reply; 26+ messages in thread
From: Rainer Orth @ 2025-09-14 16:18 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Indu Bhagat, linux-toolchains@vger.kernel.org, Jens Remus,
	Sterling Augustine, Pavel Labath, Andrii Nakryiko, Josh Poimboeuf,
	Steven Rostedt, Serhei Makarov, Binutils

Jan Beulich <jbeulich@suse.com> writes:

> On 14.09.2025 16:39, Rainer Orth wrote:
>>> On 12.09.2025 19:34, Indu Bhagat via Binutils wrote:
>>>> TL;DR: Thinking and experimenting a bit on the possible approaches for 
>>>> avoiding unaligned accesses in the SFrame FRE layout (in SFrame V3), I 
>>>> am not convinced that avoiding unaligned accesses for performance is 
>>>> worth it.  IMO, forsaking compactness for avoiding unaligned accesses is 
>>>> not a good trade off for SFrame.
>>>>
>>>> Problem Statement
>>>> On architectures such as x86_64, AArch64, and s390x, unaligned memory 
>>>> accesses are handled transparently by the hardware but incur a 
>>>> performance penalty.
>>>
>>> As you say in a reply, may incur. However, shouldn't we also consider
>>> possible ports of SFrame to architectures which don't handle this as
>>> transparently? Off the top of my head I don't, for example, recall
>>> whether RISC-V requires unaligned accesses to be handled transparently
>>> by the hardware.
>> 
>> look for STRICT_ALIGNMENT in the GCC sources in gcc/config.  While
>> several are embedded targets, there's also sparc in that list.
>
> But is this setting a good reference for the purpose here. For RISC-V it's
> command line (?) controlled (TARGET_STRICT_ALIGN), despite the spec saying
>
> "An EEI may not guarantee misaligned loads and stores are handled invisibly.
>  In this case, loads and stores that are not naturally aligned may either
>  complete execution successfully or raise an exception. The exception raised
>  can be either an address-misaligned exception or an access-fault exception."
>
> It's okay for gcc to make assumptions (assuming they're properly documented),
> but I don't think such assumptions can be extended to a discussion like the
> one here.

please look at the actual code:

gcc/config/riscv/riscv.cc:           TARGET_STRICT_ALIGN ? 0 : 1);

on RISC-V, the setting is controlled by -mstrict-align.  On most others,
it's just 1.

I'm just pointing out an easy way to answer the question: there's a
considerable number of strict-alignment targets.  If they is relevant to
the discussion at hand is for the SFrame developers to decide: if they
come to the conclusion that none of the affected CPUs is of interest,
that's certainly fine.  However, from my recently experience porting
LLVMs openmp to SPARC, fixing this as an afterthought takes some time
and analysis, so you need to decide if you want to support such targets
or not.

	Rainer

-- 
-----------------------------------------------------------------------------
Rainer Orth, Center for Biotechnology, Bielefeld University

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Unaligned access trade-offs for SFrame FRE layout
  2025-09-14 16:18       ` Rainer Orth
@ 2025-09-14 18:10         ` Jan Beulich
  2025-09-15  5:42           ` Indu Bhagat
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Beulich @ 2025-09-14 18:10 UTC (permalink / raw)
  To: Rainer Orth
  Cc: Indu Bhagat, linux-toolchains@vger.kernel.org, Jens Remus,
	Sterling Augustine, Pavel Labath, Andrii Nakryiko, Josh Poimboeuf,
	Steven Rostedt, Serhei Makarov, Binutils

On 14.09.2025 18:18, Rainer Orth wrote:
> Jan Beulich <jbeulich@suse.com> writes:
> 
>> On 14.09.2025 16:39, Rainer Orth wrote:
>>>> On 12.09.2025 19:34, Indu Bhagat via Binutils wrote:
>>>>> TL;DR: Thinking and experimenting a bit on the possible approaches for 
>>>>> avoiding unaligned accesses in the SFrame FRE layout (in SFrame V3), I 
>>>>> am not convinced that avoiding unaligned accesses for performance is 
>>>>> worth it.  IMO, forsaking compactness for avoiding unaligned accesses is 
>>>>> not a good trade off for SFrame.
>>>>>
>>>>> Problem Statement
>>>>> On architectures such as x86_64, AArch64, and s390x, unaligned memory 
>>>>> accesses are handled transparently by the hardware but incur a 
>>>>> performance penalty.
>>>>
>>>> As you say in a reply, may incur. However, shouldn't we also consider
>>>> possible ports of SFrame to architectures which don't handle this as
>>>> transparently? Off the top of my head I don't, for example, recall
>>>> whether RISC-V requires unaligned accesses to be handled transparently
>>>> by the hardware.
>>>
>>> look for STRICT_ALIGNMENT in the GCC sources in gcc/config.  While
>>> several are embedded targets, there's also sparc in that list.
>>
>> But is this setting a good reference for the purpose here. For RISC-V it's
>> command line (?) controlled (TARGET_STRICT_ALIGN), despite the spec saying
>>
>> "An EEI may not guarantee misaligned loads and stores are handled invisibly.
>>  In this case, loads and stores that are not naturally aligned may either
>>  complete execution successfully or raise an exception. The exception raised
>>  can be either an address-misaligned exception or an access-fault exception."
>>
>> It's okay for gcc to make assumptions (assuming they're properly documented),
>> but I don't think such assumptions can be extended to a discussion like the
>> one here.
> 
> please look at the actual code:
> 
> gcc/config/riscv/riscv.cc:           TARGET_STRICT_ALIGN ? 0 : 1);
> 
> on RISC-V, the setting is controlled by -mstrict-align.  On most others,
> it's just 1.

Precisely my point: By (not or wrongly) using the command line option, you
can break things. Whereas such breakage wants avoiding here.

> I'm just pointing out an easy way to answer the question: there's a
> considerable number of strict-alignment targets.  If they is relevant to
> the discussion at hand is for the SFrame developers to decide: if they
> come to the conclusion that none of the affected CPUs is of interest,
> that's certainly fine.  However, from my recently experience porting
> LLVMs openmp to SPARC, fixing this as an afterthought takes some time
> and analysis, so you need to decide if you want to support such targets
> or not.

Yes, that's exactly why I brought up the point.

Jan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Unaligned access trade-offs for SFrame FRE layout
  2025-09-14 18:10         ` Jan Beulich
@ 2025-09-15  5:42           ` Indu Bhagat
  2025-09-15 16:07             ` Steven Rostedt
  0 siblings, 1 reply; 26+ messages in thread
From: Indu Bhagat @ 2025-09-15  5:42 UTC (permalink / raw)
  To: Jan Beulich, Rainer Orth
  Cc: linux-toolchains@vger.kernel.org, Jens Remus, Sterling Augustine,
	Pavel Labath, Andrii Nakryiko, Josh Poimboeuf, Steven Rostedt,
	Serhei Makarov, Binutils

On 9/14/25 11:10 AM, Jan Beulich wrote:
> On 14.09.2025 18:18, Rainer Orth wrote:
>> Jan Beulich <jbeulich@suse.com> writes:
>>
>>> On 14.09.2025 16:39, Rainer Orth wrote:
>>>>> On 12.09.2025 19:34, Indu Bhagat via Binutils wrote:
>>>>>> TL;DR: Thinking and experimenting a bit on the possible approaches for
>>>>>> avoiding unaligned accesses in the SFrame FRE layout (in SFrame V3), I
>>>>>> am not convinced that avoiding unaligned accesses for performance is
>>>>>> worth it.  IMO, forsaking compactness for avoiding unaligned accesses is
>>>>>> not a good trade off for SFrame.
>>>>>>
>>>>>> Problem Statement
>>>>>> On architectures such as x86_64, AArch64, and s390x, unaligned memory
>>>>>> accesses are handled transparently by the hardware but incur a
>>>>>> performance penalty.
>>>>>
>>>>> As you say in a reply, may incur. However, shouldn't we also consider
>>>>> possible ports of SFrame to architectures which don't handle this as
>>>>> transparently? Off the top of my head I don't, for example, recall
>>>>> whether RISC-V requires unaligned accesses to be handled transparently
>>>>> by the hardware.
>>>>
>>>> look for STRICT_ALIGNMENT in the GCC sources in gcc/config.  While
>>>> several are embedded targets, there's also sparc in that list.
>>>
>>> But is this setting a good reference for the purpose here. For RISC-V it's
>>> command line (?) controlled (TARGET_STRICT_ALIGN), despite the spec saying
>>>
>>> "An EEI may not guarantee misaligned loads and stores are handled invisibly.
>>>   In this case, loads and stores that are not naturally aligned may either
>>>   complete execution successfully or raise an exception. The exception raised
>>>   can be either an address-misaligned exception or an access-fault exception."
>>>
>>> It's okay for gcc to make assumptions (assuming they're properly documented),
>>> but I don't think such assumptions can be extended to a discussion like the
>>> one here.
>>
>> please look at the actual code:
>>
>> gcc/config/riscv/riscv.cc:           TARGET_STRICT_ALIGN ? 0 : 1);
>>
>> on RISC-V, the setting is controlled by -mstrict-align.  On most others,
>> it's just 1.
> 
> Precisely my point: By (not or wrongly) using the command line option, you
> can break things. Whereas such breakage wants avoiding here.
> 
>> I'm just pointing out an easy way to answer the question: there's a
>> considerable number of strict-alignment targets.  If they is relevant to
>> the discussion at hand is for the SFrame developers to decide: if they
>> come to the conclusion that none of the affected CPUs is of interest,
>> that's certainly fine.  However, from my recently experience porting
>> LLVMs openmp to SPARC, fixing this as an afterthought takes some time
>> and analysis, so you need to decide if you want to support such targets
>> or not.
> 
> Yes, that's exactly why I brought up the point.
> 

In such cases, the routines reading the SFrame data under consideration 
here (SFrame FRE start addr, and SFrame FRE stack offsets) from memory 
will need to use a memcpy to copy out the data to an aligned location.

In GNU Binutils libsframe (used by ld), we do the above. Such a "SFrame 
FRE decoding" routine could be provided in a arch-specific manner in 
SFrame stack tracers.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Unaligned access trade-offs for SFrame FRE layout
  2025-09-14 15:23     ` Jan Beulich
  2025-09-14 16:18       ` Rainer Orth
@ 2025-09-15  9:08       ` Segher Boessenkool
  1 sibling, 0 replies; 26+ messages in thread
From: Segher Boessenkool @ 2025-09-15  9:08 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Rainer Orth, Indu Bhagat, linux-toolchains@vger.kernel.org,
	Jens Remus

On 14.09.2025 16:39, Rainer Orth wrote:
>> On 12.09.2025 19:34, Indu Bhagat via Binutils wrote:
> look for STRICT_ALIGNMENT in the GCC sources in gcc/config.  While
> several are embedded targets, there's also sparc in that list.
>
> But is this setting a good reference for the purpose here.

It is not a setting usually, also not for you, it is a command-line
option.  Often it is good (for performance) to only do naturally aligned
accesses.  For some (sub-)archs it simply is impossible to even try to
do misaligned accesses OTOH.

> It's okay for gcc to make assumptions (assuming they're properly
> documented),

Only some implementation decisions have to be documented.  And
assumptions are almost never okay.  Typically we depend on the user
promising they do not do X (via a -mno-X flag, say) before we assume
they do not do X.

Segher

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Unaligned access trade-offs for SFrame FRE layout
  2025-09-13  7:56   ` Indu Bhagat
@ 2025-09-15 16:04     ` Steven Rostedt
  0 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2025-09-15 16:04 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: Jens Remus, Sterling Augustine, Pavel Labath, Andrii Nakryiko,
	Josh Poimboeuf, Serhei Makarov, Binutils,
	linux-toolchains@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2519 bytes --]

On Sat, 13 Sep 2025 00:56:34 -0700
Indu Bhagat <indu.bhagat@oracle.com> wrote:

> I think quantifying the performance impact of unaligned accesses for 
> stack tracing using SFrame sections will be larger experiment which will 
> be hardware dependent..
> 
> https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/

I'm not so sure his example is good enough to show the overhead. So I wrote
a much simpler test. I create an array of 1,000,004 unsigned longs and fill
it with a simple increment.

I then loop over the array and reading at every 5618 increments and modulus
it to 1,000,000. I use 5816 because it is a factor of 1,000,004. By
incrementing it by this number and modulus it with 1,000,000, then the wrap
will go to 1,000,000 + 4. meaning by looping a 1,000,000 times on a
1,000,000 words should hit all of them if they are 4 byte long words or
half of them if they are 8 byte words.

I time a loop of going over all 1,000,000 numbers and simply adding them. I
then report the total time.

The test takes an offset to add before reading. Here's my results on two
machines:

On my workstation: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz

  $ ./aligned-access 0
  val = 2296870857426500744
  time = 6822 us
  $ ./aligned-access 1
  val = -2296870857426996608
  time = 7319 us
  $ ./aligned-access 2
  val = 2296870857426500744
  time = 7432 us
  $ ./aligned-access 3
  val = -2296870857426996608
  time = 7522 us
  $ ./aligned-access 4
  val = 2296870857426500744
  time = 6841 us

On my server: Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz

  $ ./aligned-access 0
  val = 2296870857426500744
  time = 7093 us
  $ ./aligned-access 1
  val = -2296870857426996608
  time = 6948 us
  $ ./aligned-access 2
  val = 2296870857426500744
  time = 6761 us
  $ ./aligned-access 3
  val = -2296870857426996608
  time = 7111 us
  $ ./aligned-access 4
  val = 2296870857426500744
  time = 6940 us
  $ ./aligned-access 5
  val = -2296870857426996608
  time = 6939 us
  $ ./aligned-access 6
  val = 2296870857426500744
  time = 6937 us
  $ ./aligned-access 7
  val = -2296870857426996608
  time = 7254 us
  $ ./aligned-access 8
  val = 2296870857426500744
  time = 6939 us

My workstation is a bit older than my server, and it looks like alignment
does make a difference. For my server, it didn't show any difference.

Thus, it looks like it's only a problem for older machines (on x86). Would
be good to see how the performance of this is on arm64 machines.

But feel free to try it out yourself.

-- Steve

[-- Attachment #2: aligned-access.c --]
[-- Type: text/x-c++src, Size: 1948 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdarg.h>
#include <errno.h>
#include <time.h>

static char *argv0;

static char *get_this_name(void)
{
	static char *this_name;
	char *arg;
	char *p;

	if (this_name)
		return this_name;

	arg = argv0;
	p = arg+strlen(arg);

	while (p >= arg && *p != '/')
		p--;
	p++;

	this_name = p;
	return p;
}

static void usage(void)
{
	char *p = get_this_name();

	printf("usage: %s [alignement]\n"
	       "  alignment is a number offset\n"
	       "\n",p);
	exit(-1);
}

static void __vdie(const char *fmt, va_list ap, int err)
{
	int ret = errno;
	char *p = get_this_name();

	if (err && errno)
		perror(p);
	else
		ret = -1;

	fprintf(stderr, "  ");
	vfprintf(stderr, fmt, ap);

	fprintf(stderr, "\n");
	exit(ret);
}

void die(const char *fmt, ...)
{
	va_list ap;

	va_start(ap, fmt);
	__vdie(fmt, ap, 0);
	va_end(ap);
}

void pdie(const char *fmt, ...)
{
	va_list ap;

	va_start(ap, fmt);
	__vdie(fmt, ap, 1);
	va_end(ap);
}

static unsigned long long get_time(void)
{
	unsigned long long time;
	struct timespec ts;

	clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
	time = ts.tv_sec * 1000000;
	time += ts.tv_nsec / 1000;

	return time;
}

#define ARRAY_SIZE 1000000

static void fill_array(void *array)
{
	for (int i = 0; i < ARRAY_SIZE; i++) {
		*(long *)(array + sizeof(long) * i) = i;
	}
}

int main (int argc, char **argv)
{
	unsigned long long start, end;
	unsigned long long val = 0;
	unsigned long long idx, i;
	void *array;
	int offset;

	argv0 = argv[0];

	if (argc < 2)
		usage();

	array = malloc((ARRAY_SIZE + 1) * sizeof(long));
	if (!array)
		pdie("Allocating array");

	fill_array(array);

	offset = atoi(argv[1]);

	start = get_time();
	for (i = 0; i < ARRAY_SIZE; i++) {
		idx = (i * 5618) + offset;
		idx %= 1000000;
		val += *(unsigned long *)(array + idx);
	}
	end = get_time();

	printf("val = %lld\n", val);
	printf("time = %lld us\n", end - start);
	
	return 0;
}

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Unaligned access trade-offs for SFrame FRE layout
  2025-09-15  5:42           ` Indu Bhagat
@ 2025-09-15 16:07             ` Steven Rostedt
  2025-09-15 17:22               ` Segher Boessenkool
  2025-09-16  6:05               ` Fangrui Song
  0 siblings, 2 replies; 26+ messages in thread
From: Steven Rostedt @ 2025-09-15 16:07 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: Jan Beulich, Rainer Orth, linux-toolchains@vger.kernel.org,
	Jens Remus, Sterling Augustine, Pavel Labath, Andrii Nakryiko,
	Josh Poimboeuf, Serhei Makarov, Binutils

On Sun, 14 Sep 2025 22:42:46 -0700
Indu Bhagat <indu.bhagat@oracle.com> wrote:

> In such cases, the routines reading the SFrame data under consideration 
> here (SFrame FRE start addr, and SFrame FRE stack offsets) from memory 
> will need to use a memcpy to copy out the data to an aligned location.
> 
> In GNU Binutils libsframe (used by ld), we do the above. Such a "SFrame 
> FRE decoding" routine could be provided in a arch-specific manner in 
> SFrame stack tracers.

I'm perfectly fine with making it a requirement for the reader of the
SFrame section having to use memcpy into an aligned structure for reading
if the architecture requires it. Let only the architectures that have
issues with unaligned access take the performance hit.

-- Steve

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Unaligned access trade-offs for SFrame FRE layout
  2025-09-15 16:07             ` Steven Rostedt
@ 2025-09-15 17:22               ` Segher Boessenkool
  2025-09-16  6:05               ` Fangrui Song
  1 sibling, 0 replies; 26+ messages in thread
From: Segher Boessenkool @ 2025-09-15 17:22 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Indu Bhagat, Jan Beulich, Rainer Orth,
	linux-toolchains@vger.kernel.org, Jens Remus, Sterling Augustine,
	Pavel Labath, Andrii Nakryiko, Josh Poimboeuf, Serhei Makarov,
	Binutils

On Mon, Sep 15, 2025 at 12:07:42PM -0400, Steven Rostedt wrote:
> On Sun, 14 Sep 2025 22:42:46 -0700
> Indu Bhagat <indu.bhagat@oracle.com> wrote:
> 
> > In such cases, the routines reading the SFrame data under consideration 
> > here (SFrame FRE start addr, and SFrame FRE stack offsets) from memory 
> > will need to use a memcpy to copy out the data to an aligned location.
> > 
> > In GNU Binutils libsframe (used by ld), we do the above. Such a "SFrame 
> > FRE decoding" routine could be provided in a arch-specific manner in 
> > SFrame stack tracers.
> 
> I'm perfectly fine with making it a requirement for the reader of the
> SFrame section having to use memcpy into an aligned structure for reading
> if the architecture requires it. Let only the architectures that have
> issues with unaligned access take the performance hit.

Constructing the bigger value from a whole bunch of byte reads should be
pretty optimal, too.  Just don't force misaligned bigger reads, not even
on platforms where that *does* work (not all!), it might well be really,
really slow.


Segher

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Unaligned access trade-offs for SFrame FRE layout
  2025-09-15 16:07             ` Steven Rostedt
  2025-09-15 17:22               ` Segher Boessenkool
@ 2025-09-16  6:05               ` Fangrui Song
  2025-09-16 15:58                 ` Steven Rostedt
                                   ` (2 more replies)
  1 sibling, 3 replies; 26+ messages in thread
From: Fangrui Song @ 2025-09-16  6:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Indu Bhagat, Jan Beulich, Rainer Orth,
	linux-toolchains@vger.kernel.org, Jens Remus, Sterling Augustine,
	Pavel Labath, Andrii Nakryiko, Josh Poimboeuf, Serhei Makarov,
	Binutils

On Mon, Sep 15, 2025 at 9:12 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Sun, 14 Sep 2025 22:42:46 -0700
> Indu Bhagat <indu.bhagat@oracle.com> wrote:
>
> > In such cases, the routines reading the SFrame data under consideration
> > here (SFrame FRE start addr, and SFrame FRE stack offsets) from memory
> > will need to use a memcpy to copy out the data to an aligned location.
> >
> > In GNU Binutils libsframe (used by ld), we do the above. Such a "SFrame
> > FRE decoding" routine could be provided in a arch-specific manner in
> > SFrame stack tracers.
>
> I'm perfectly fine with making it a requirement for the reader of the
> SFrame section having to use memcpy into an aligned structure for reading
> if the architecture requires it. Let only the architectures that have
> issues with unaligned access take the performance hit.
>
> -- Steve

I agree. Unaligned access has nearly zero performance impact on modern
architectures, provided the access doesn't span additional cache
lines.
The padding required for alignment would increase the size, likely
creating more overhead than any alignment benefit would justify.

(
From a linker and binary utilities perspective, I'd even suggest
adopting a universal little-endian format regardless of the target
system's native endianness.
This would eliminate the need for endianness templates in the C++ code
and simplify toolchain implementation across platforms.

On the big-endian z/Architecture, this is efficient: the LOAD REVERSED
instructions are used by the bswap versions in the following program,
not even requiring extra instructions.
#define WIDTH(x) \
typedef __UINT##x##_TYPE__ [[gnu::aligned(1)]] uint##x; \
uint##x load_inc##x(uint##x *p) { return *p+1; } \
uint##x load_bswap_inc##x(uint##x *p) { return __builtin_bswap##x(*p)+1; }; \
uint##x load_eq##x(uint##x *p) { return *p==3; } \
uint##x load_bswap_eq##x(uint##x *p) { return __builtin_bswap##x(*p)==3; }; \

WIDTH(16);
WIDTH(32);
WIDTH(64);
)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Unaligned access trade-offs for SFrame FRE layout
  2025-09-16  6:05               ` Fangrui Song
@ 2025-09-16 15:58                 ` Steven Rostedt
  2025-09-18 10:39                   ` Jens Remus
  2025-09-16 16:03                 ` Indu Bhagat
  2025-09-17 21:12                 ` Steven Rostedt
  2 siblings, 1 reply; 26+ messages in thread
From: Steven Rostedt @ 2025-09-16 15:58 UTC (permalink / raw)
  To: Fangrui Song
  Cc: Indu Bhagat, Jan Beulich, Rainer Orth,
	linux-toolchains@vger.kernel.org, Jens Remus, Sterling Augustine,
	Pavel Labath, Andrii Nakryiko, Josh Poimboeuf, Serhei Makarov,
	Binutils

On Mon, 15 Sep 2025 23:05:09 -0700
Fangrui Song <maskray@sourceware.org> wrote:

> From a linker and binary utilities perspective, I'd even suggest  
> adopting a universal little-endian format regardless of the target
> system's native endianness.
> This would eliminate the need for endianness templates in the C++ code
> and simplify toolchain implementation across platforms.
> 
> On the big-endian z/Architecture, this is efficient: the LOAD REVERSED
> instructions are used by the bswap versions in the following program,
> not even requiring extra instructions.
> #define WIDTH(x) \
> typedef __UINT##x##_TYPE__ [[gnu::aligned(1)]] uint##x; \
> uint##x load_inc##x(uint##x *p) { return *p+1; } \
> uint##x load_bswap_inc##x(uint##x *p) { return __builtin_bswap##x(*p)+1; }; \
> uint##x load_eq##x(uint##x *p) { return *p==3; } \
> uint##x load_bswap_eq##x(uint##x *p) { return __builtin_bswap##x(*p)==3; }; \
> 
> WIDTH(16);
> WIDTH(32);
> WIDTH(64);

I would like to hear the comments from Jens on this, as he's adapting
SFrames for the s390 which I believe is big-endian.

-- Steve


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Unaligned access trade-offs for SFrame FRE layout
  2025-09-16  6:05               ` Fangrui Song
  2025-09-16 15:58                 ` Steven Rostedt
@ 2025-09-16 16:03                 ` Indu Bhagat
  2025-09-16 16:32                   ` Fangrui Song
  2025-09-17 21:12                 ` Steven Rostedt
  2 siblings, 1 reply; 26+ messages in thread
From: Indu Bhagat @ 2025-09-16 16:03 UTC (permalink / raw)
  To: Fangrui Song, Steven Rostedt
  Cc: Jan Beulich, Rainer Orth, linux-toolchains@vger.kernel.org,
	Jens Remus, Sterling Augustine, Pavel Labath, Andrii Nakryiko,
	Josh Poimboeuf, Serhei Makarov, Binutils

On 9/15/25 11:05 PM, Fangrui Song wrote:
> On Mon, Sep 15, 2025 at 9:12 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>>
>> On Sun, 14 Sep 2025 22:42:46 -0700
>> Indu Bhagat <indu.bhagat@oracle.com> wrote:
>>
>>> In such cases, the routines reading the SFrame data under consideration
>>> here (SFrame FRE start addr, and SFrame FRE stack offsets) from memory
>>> will need to use a memcpy to copy out the data to an aligned location.
>>>
>>> In GNU Binutils libsframe (used by ld), we do the above. Such a "SFrame
>>> FRE decoding" routine could be provided in a arch-specific manner in
>>> SFrame stack tracers.
>>
>> I'm perfectly fine with making it a requirement for the reader of the
>> SFrame section having to use memcpy into an aligned structure for reading
>> if the architecture requires it. Let only the architectures that have
>> issues with unaligned access take the performance hit.
>>
>> -- Steve
> 
> I agree. Unaligned access has nearly zero performance impact on modern
> architectures, provided the access doesn't span additional cache
> lines.
> The padding required for alignment would increase the size, likely
> creating more overhead than any alignment benefit would justify.
> 
> (
>  From a linker and binary utilities perspective, I'd even suggest
> adopting a universal little-endian format regardless of the target
> system's native endianness.
> This would eliminate the need for endianness templates in the C++ code
> and simplify toolchain implementation across platforms.
> 

(Perhaps I am missing something) Wouldnt a toolchain implementation need 
endianness handling anyway to support cross toolchains?

> On the big-endian z/Architecture, this is efficient: the LOAD REVERSED
> instructions are used by the bswap versions in the following program,
> not even requiring extra instructions.
> #define WIDTH(x) \
> typedef __UINT##x##_TYPE__ [[gnu::aligned(1)]] uint##x; \
> uint##x load_inc##x(uint##x *p) { return *p+1; } \
> uint##x load_bswap_inc##x(uint##x *p) { return __builtin_bswap##x(*p)+1; }; \
> uint##x load_eq##x(uint##x *p) { return *p==3; } \
> uint##x load_bswap_eq##x(uint##x *p) { return __builtin_bswap##x(*p)==3; }; \
> 
> WIDTH(16);
> WIDTH(32);
> WIDTH(64);
> )

For AArch64 which SFrame supports too, this is not true. AArch64 has 
both LE and BE.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Unaligned access trade-offs for SFrame FRE layout
  2025-09-16 16:03                 ` Indu Bhagat
@ 2025-09-16 16:32                   ` Fangrui Song
  2025-09-16 16:44                     ` Segher Boessenkool
  2025-09-16 17:33                     ` Indu Bhagat
  0 siblings, 2 replies; 26+ messages in thread
From: Fangrui Song @ 2025-09-16 16:32 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: Fangrui Song, Steven Rostedt, Jan Beulich, Rainer Orth,
	linux-toolchains@vger.kernel.org, Jens Remus, Sterling Augustine,
	Pavel Labath, Andrii Nakryiko, Josh Poimboeuf, Serhei Makarov,
	Binutils

On Tue, Sep 16, 2025 at 9:03 AM Indu Bhagat <indu.bhagat@oracle.com> wrote:
>
> On 9/15/25 11:05 PM, Fangrui Song wrote:
> > On Mon, Sep 15, 2025 at 9:12 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> >>
> >> On Sun, 14 Sep 2025 22:42:46 -0700
> >> Indu Bhagat <indu.bhagat@oracle.com> wrote:
> >>
> >>> In such cases, the routines reading the SFrame data under consideration
> >>> here (SFrame FRE start addr, and SFrame FRE stack offsets) from memory
> >>> will need to use a memcpy to copy out the data to an aligned location.
> >>>
> >>> In GNU Binutils libsframe (used by ld), we do the above. Such a "SFrame
> >>> FRE decoding" routine could be provided in a arch-specific manner in
> >>> SFrame stack tracers.
> >>
> >> I'm perfectly fine with making it a requirement for the reader of the
> >> SFrame section having to use memcpy into an aligned structure for reading
> >> if the architecture requires it. Let only the architectures that have
> >> issues with unaligned access take the performance hit.
> >>
> >> -- Steve
> >
> > I agree. Unaligned access has nearly zero performance impact on modern
> > architectures, provided the access doesn't span additional cache
> > lines.
> > The padding required for alignment would increase the size, likely
> > creating more overhead than any alignment benefit would justify.
> >
> > (
> >  From a linker and binary utilities perspective, I'd even suggest
> > adopting a universal little-endian format regardless of the target
> > system's native endianness.
> > This would eliminate the need for endianness templates in the C++ code
> > and simplify toolchain implementation across platforms.
> >
>
> (Perhaps I am missing something) Wouldnt a toolchain implementation need
> endianness handling anyway to support cross toolchains?
>
> > On the big-endian z/Architecture, this is efficient: the LOAD REVERSED
> > instructions are used by the bswap versions in the following program,
> > not even requiring extra instructions.
> > #define WIDTH(x) \
> > typedef __UINT##x##_TYPE__ [[gnu::aligned(1)]] uint##x; \
> > uint##x load_inc##x(uint##x *p) { return *p+1; } \
> > uint##x load_bswap_inc##x(uint##x *p) { return __builtin_bswap##x(*p)+1; }; \
> > uint##x load_eq##x(uint##x *p) { return *p==3; } \
> > uint##x load_bswap_eq##x(uint##x *p) { return __builtin_bswap##x(*p)==3; }; \
> >
> > WIDTH(16);
> > WIDTH(32);
> > WIDTH(64);
> > )
>
> For AArch64 which SFrame supports too, this is not true. AArch64 has
> both LE and BE.

While runtime consumers typically handle a single endianness, other
tools like linkers and binary utilities must support both. They have
to support cross compilation, producing a big-endian executable from a
little-endian host.

A universal little-endian approach simplifies code. Instead of using a
function like read32le(config, p), where config->endian specifies the
object file's endianness, or read32(p) with an internal endianness
check, the code can simply use read32le(p).

The read32le(p) function is either a standard read or a byte-swapped
read. This byte-swapping is fast on aarch64be (thanks to REV16 and
REV32 instructions) and s390x (byte-swap load).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Unaligned access trade-offs for SFrame FRE layout
  2025-09-16 16:32                   ` Fangrui Song
@ 2025-09-16 16:44                     ` Segher Boessenkool
  2025-09-16 17:05                       ` Fangrui Song
  2025-09-16 17:54                       ` Segher Boessenkool
  2025-09-16 17:33                     ` Indu Bhagat
  1 sibling, 2 replies; 26+ messages in thread
From: Segher Boessenkool @ 2025-09-16 16:44 UTC (permalink / raw)
  To: Fangrui Song
  Cc: Indu Bhagat, Steven Rostedt, Jan Beulich, Rainer Orth,
	linux-toolchains@vger.kernel.org, Jens Remus, Sterling Augustine,
	Pavel Labath, Andrii Nakryiko, Josh Poimboeuf, Serhei Makarov,
	Binutils

On Tue, Sep 16, 2025 at 09:32:30AM -0700, Fangrui Song wrote:
> The read32le(p) function is either a standard read or a byte-swapped
> read.

You should never overcomplicate things by doing byte-swaps.  Instead,
just say what you mean:

u32 read32le(u8 *p)
{
	return p[0] + 0x100*p[1] + 0x10000*p[2] + 0x1000000*p[3];
}

or something like that.  The compiler can optimise such things just
fine!  There is no need to go via extra indirections.


Segher

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Unaligned access trade-offs for SFrame FRE layout
  2025-09-16 16:44                     ` Segher Boessenkool
@ 2025-09-16 17:05                       ` Fangrui Song
  2025-09-16 17:54                       ` Segher Boessenkool
  1 sibling, 0 replies; 26+ messages in thread
From: Fangrui Song @ 2025-09-16 17:05 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Fangrui Song, Indu Bhagat, Steven Rostedt, Jan Beulich,
	Rainer Orth, linux-toolchains@vger.kernel.org, Jens Remus,
	Sterling Augustine, Pavel Labath, Andrii Nakryiko, Josh Poimboeuf,
	Serhei Makarov, Binutils

On Tue, Sep 16, 2025 at 9:44 AM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> On Tue, Sep 16, 2025 at 09:32:30AM -0700, Fangrui Song wrote:
> > The read32le(p) function is either a standard read or a byte-swapped
> > read.
>
> You should never overcomplicate things by doing byte-swaps.  Instead,
> just say what you mean:
>
> u32 read32le(u8 *p)
> {
>         return p[0] + 0x100*p[1] + 0x10000*p[2] + 0x1000000*p[3];
> }
>
> or something like that.  The compiler can optimise such things just
> fine!  There is no need to go via extra indirections.
>
>
> Segher

I made a typo in my previous message.

> Instead of using a function like read32<del>be</deel>(config, p), where config->endian specifies the object file's endianness, or read32(p) with an internal endianness check, the code can simply use read32le(p).

When aarch64be or s390x also use little-endian format, a little-endian
host processing their object files can utilize read32le(p), which the
compiler will optimize to either a standard or byte-swapped read.
I refer to the compiler-generated machine code.

If aarch64be or s390x use a big-endian format, the consumer will need
an extra argument in the read32 function or a global context inside
read32.

I think ELF's design, which emphasizes natural size and alignment
guidelines for its control structures, is outdated.
https://maskray.me/blog/2024-03-09-a-compact-relocation-format-for-elf#leb128-among-variable-length-integer-encodings
Fortunately, we appear to have achieved some space savings without
needing to implement variable-length encoding. In case it's useful,
variable-length integer encodings like LEB128, PrefixVarInt, or
SuffixVarInt ( https://maskray.me/blog/2024-03-09-a-compact-relocation-format-for-elf#leb128-among-variable-length-integer-encodings
) could potentially help in certain scenarios. However, these
approaches might necessitate additional relocations to support RISC-V
and LoongArch linker relaxation.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Unaligned access trade-offs for SFrame FRE layout
  2025-09-16 16:32                   ` Fangrui Song
  2025-09-16 16:44                     ` Segher Boessenkool
@ 2025-09-16 17:33                     ` Indu Bhagat
  1 sibling, 0 replies; 26+ messages in thread
From: Indu Bhagat @ 2025-09-16 17:33 UTC (permalink / raw)
  To: Fangrui Song
  Cc: Steven Rostedt, Jan Beulich, Rainer Orth,
	linux-toolchains@vger.kernel.org, Jens Remus, Sterling Augustine,
	Pavel Labath, Andrii Nakryiko, Josh Poimboeuf, Serhei Makarov,
	Binutils

On 9/16/25 9:32 AM, Fangrui Song wrote:
> On Tue, Sep 16, 2025 at 9:03 AM Indu Bhagat <indu.bhagat@oracle.com> wrote:
>>
>> On 9/15/25 11:05 PM, Fangrui Song wrote:
>>> On Mon, Sep 15, 2025 at 9:12 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>>>>
>>>> On Sun, 14 Sep 2025 22:42:46 -0700
>>>> Indu Bhagat <indu.bhagat@oracle.com> wrote:
>>>>
>>>>> In such cases, the routines reading the SFrame data under consideration
>>>>> here (SFrame FRE start addr, and SFrame FRE stack offsets) from memory
>>>>> will need to use a memcpy to copy out the data to an aligned location.
>>>>>
>>>>> In GNU Binutils libsframe (used by ld), we do the above. Such a "SFrame
>>>>> FRE decoding" routine could be provided in a arch-specific manner in
>>>>> SFrame stack tracers.
>>>>
>>>> I'm perfectly fine with making it a requirement for the reader of the
>>>> SFrame section having to use memcpy into an aligned structure for reading
>>>> if the architecture requires it. Let only the architectures that have
>>>> issues with unaligned access take the performance hit.
>>>>
>>>> -- Steve
>>>
>>> I agree. Unaligned access has nearly zero performance impact on modern
>>> architectures, provided the access doesn't span additional cache
>>> lines.
>>> The padding required for alignment would increase the size, likely
>>> creating more overhead than any alignment benefit would justify.
>>>
>>> (
>>>   From a linker and binary utilities perspective, I'd even suggest
>>> adopting a universal little-endian format regardless of the target
>>> system's native endianness.
>>> This would eliminate the need for endianness templates in the C++ code
>>> and simplify toolchain implementation across platforms.
>>>
>>
>> (Perhaps I am missing something) Wouldnt a toolchain implementation need
>> endianness handling anyway to support cross toolchains?
>>
>>> On the big-endian z/Architecture, this is efficient: the LOAD REVERSED
>>> instructions are used by the bswap versions in the following program,
>>> not even requiring extra instructions.
>>> #define WIDTH(x) \
>>> typedef __UINT##x##_TYPE__ [[gnu::aligned(1)]] uint##x; \
>>> uint##x load_inc##x(uint##x *p) { return *p+1; } \
>>> uint##x load_bswap_inc##x(uint##x *p) { return __builtin_bswap##x(*p)+1; }; \
>>> uint##x load_eq##x(uint##x *p) { return *p==3; } \
>>> uint##x load_bswap_eq##x(uint##x *p) { return __builtin_bswap##x(*p)==3; }; \
>>>
>>> WIDTH(16);
>>> WIDTH(32);
>>> WIDTH(64);
>>> )
>>
>> For AArch64 which SFrame supports too, this is not true. AArch64 has
>> both LE and BE.
> 
> While runtime consumers typically handle a single endianness, other
> tools like linkers and binary utilities must support both. They have
> to support cross compilation, producing a big-endian executable from a
> little-endian host.
> 

Right.  Sorry, I am still missing the link between "complexity of 
endianness templates in the C++ code" vs what you say in the next 
paragraph: endian aware read/write is anyway necessary.

> A universal little-endian approach simplifies code. Instead of using a
> function like read32le(config, p), where config->endian specifies the
> object file's endianness, or read32(p) with an internal endianness
> check, the code can simply use read32le(p).
> 
> The read32le(p) function is either a standard read or a byte-swapped
> read. This byte-swapping is fast on aarch64be (thanks to REV16 and
> REV32 instructions) and s390x (byte-swap load).

The rev* instruction is in the data dependency chain.  This means that 
using little-endian for AArch64 BE then defers the task of endian swap 
on to the stack tracers. E.g., aarch64 (added insn in dependency chain):

load_inc16:
         ldrh    w0, [x0]
         add     w0, w0, 1
         ret
load_bswap_inc16:
         ldrh    w0, [x0]
         rev16   w0, w0
         add     w0, w0, 1
         ret

s390x (same height of dependency chain):

load_inc16:
         lh      %r2,0(%r2)
         ahi     %r2,1
         llghr   %r2,%r2
         br      %r14
load_bswap_inc16:
         lrvh    %r2,0(%r2)
         ahi     %r2,1
         llghr   %r2,%r2
         br      %r14

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Unaligned access trade-offs for SFrame FRE layout
  2025-09-16 16:44                     ` Segher Boessenkool
  2025-09-16 17:05                       ` Fangrui Song
@ 2025-09-16 17:54                       ` Segher Boessenkool
  1 sibling, 0 replies; 26+ messages in thread
From: Segher Boessenkool @ 2025-09-16 17:54 UTC (permalink / raw)
  To: Fangrui Song
  Cc: Indu Bhagat, Steven Rostedt, Jan Beulich, Rainer Orth,
	linux-toolchains@vger.kernel.org, Jens Remus, Sterling Augustine,
	Pavel Labath, Andrii Nakryiko, Josh Poimboeuf, Serhei Makarov,
	Binutils

On Tue, Sep 16, 2025 at 11:44:26AM -0500, Segher Boessenkool wrote:
> On Tue, Sep 16, 2025 at 09:32:30AM -0700, Fangrui Song wrote:
> > The read32le(p) function is either a standard read or a byte-swapped
> > read.
> 
> You should never overcomplicate things by doing byte-swaps.  Instead,
> just say what you mean:
> 
> u32 read32le(u8 *p)
> {
> 	return p[0] + 0x100*p[1] + 0x10000*p[2] + 0x1000000*p[3];
> }
> 
> or something like that.  The compiler can optimise such things just
> fine!  There is no need to go via extra indirections.

The following actually compiles to optimal code, both with -mbig and
with -mlittle:

===
typedef unsigned int u32;
typedef unsigned char u8;

u32 read32le(u8 *p)
{
        return (u32)p[0] | (u32)p[1]<<8 | (u32)p[2]<<16 | (u32)p[3]<<24;
}
===

With -O2 -mbig:
        lwbrx 3,0,3      # 10   [c=8 l=4]  bswapsi2_load
        blr              # 18   [c=4 l=4]  simple_return
(on a BE system), and with -O2 -mlittle:
        lwz 3,0(3)       # 11   [c=8 l=4]  *movsi_internal1/3
        blr              # 19   [c=4 l=4]  simple_return
(I used -mcpu=power10, because a) why not, and b) with an ancient CPU
GCC will make more sure not to do misaligned accesses.  Power8 is fine
already, 970 (aka Apple G5) isn't (for the LE accesses on a BE host):
and that is good, because such accesses will frequently trap, so on
average they are quite expensive if done as a single read.


Segher

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Unaligned access trade-offs for SFrame FRE layout
  2025-09-16  6:05               ` Fangrui Song
  2025-09-16 15:58                 ` Steven Rostedt
  2025-09-16 16:03                 ` Indu Bhagat
@ 2025-09-17 21:12                 ` Steven Rostedt
  2025-09-17 23:55                   ` Alan Modra
  2 siblings, 1 reply; 26+ messages in thread
From: Steven Rostedt @ 2025-09-17 21:12 UTC (permalink / raw)
  To: Fangrui Song
  Cc: Indu Bhagat, Jan Beulich, Rainer Orth,
	linux-toolchains@vger.kernel.org, Jens Remus, Sterling Augustine,
	Pavel Labath, Andrii Nakryiko, Josh Poimboeuf, Serhei Makarov,
	Binutils

On Mon, 15 Sep 2025 23:05:09 -0700
Fangrui Song <maskray@sourceware.org> wrote:

> From a linker and binary utilities perspective, I'd even suggest  
> adopting a universal little-endian format regardless of the target
> system's native endianness.
> This would eliminate the need for endianness templates in the C++ code
> and simplify toolchain implementation across platforms.

Thinking about this more, I have some concerns with having the SFrame
section being always in little endian format.

1. Is there precedent for an ELF section to be in a different endian than
   what the ELF file is designated as? If not, I don't think we should be
   adding one.

2. This moves the computation from build /link time to run time. As a kernel
   developer, whenever possible, if we can have longer build times for
   quicker runtime we go ahead and do that.

3. It makes the kernel code a bit more complicated.

Basically, if we decide to have SFrames in little endian, then all big
endian machines will be taking a hit at *every* stack trace! If you are
doing one stack trace a millisecond, that means this hit happens 1000s of
times a second. And that's for reading every item in the SFrame section.

We want the stack traces to be fast as possible as they will be slowing
down the application that is being profiled. Doing byte swaps will likely
have a noticeable impact.

I would strongly suggest keeping the SFrame values in the endian of the
machine it will be running on.

-- Steve

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Unaligned access trade-offs for SFrame FRE layout
  2025-09-17 21:12                 ` Steven Rostedt
@ 2025-09-17 23:55                   ` Alan Modra
  0 siblings, 0 replies; 26+ messages in thread
From: Alan Modra @ 2025-09-17 23:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Fangrui Song, Indu Bhagat, Jan Beulich, Rainer Orth,
	linux-toolchains@vger.kernel.org, Jens Remus, Sterling Augustine,
	Pavel Labath, Andrii Nakryiko, Josh Poimboeuf, Serhei Makarov,
	Binutils

On Wed, Sep 17, 2025 at 05:12:51PM -0400, Steven Rostedt wrote:
> On Mon, 15 Sep 2025 23:05:09 -0700
> Fangrui Song <maskray@sourceware.org> wrote:
> 
> > From a linker and binary utilities perspective, I'd even suggest  
> > adopting a universal little-endian format regardless of the target
> > system's native endianness.
> > This would eliminate the need for endianness templates in the C++ code
> > and simplify toolchain implementation across platforms.
> 
> Thinking about this more, I have some concerns with having the SFrame
> section being always in little endian format.
> 
> 1. Is there precedent for an ELF section to be in a different endian than
>    what the ELF file is designated as? If not, I don't think we should be
>    adding one.

Quoting the ELF gABI:
Byte e_ident[EI_DATA] specifies the encoding of both the data
structures used by object file container and data contained in object
file sections.

> 
> 2. This moves the computation from build /link time to run time. As a kernel
>    developer, whenever possible, if we can have longer build times for
>    quicker runtime we go ahead and do that.
> 
> 3. It makes the kernel code a bit more complicated.
> 
> Basically, if we decide to have SFrames in little endian, then all big
> endian machines will be taking a hit at *every* stack trace! If you are
> doing one stack trace a millisecond, that means this hit happens 1000s of
> times a second. And that's for reading every item in the SFrame section.
> 
> We want the stack traces to be fast as possible as they will be slowing
> down the application that is being profiled. Doing byte swaps will likely
> have a noticeable impact.
> 
> I would strongly suggest keeping the SFrame values in the endian of the
> machine it will be running on.

I agree.

-- 
Alan Modra

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Unaligned access trade-offs for SFrame FRE layout
  2025-09-16 15:58                 ` Steven Rostedt
@ 2025-09-18 10:39                   ` Jens Remus
  0 siblings, 0 replies; 26+ messages in thread
From: Jens Remus @ 2025-09-18 10:39 UTC (permalink / raw)
  To: Steven Rostedt, Fangrui Song
  Cc: Indu Bhagat, Jan Beulich, Rainer Orth,
	linux-toolchains@vger.kernel.org, Sterling Augustine,
	Pavel Labath, Andrii Nakryiko, Josh Poimboeuf, Serhei Makarov,
	Binutils, Heiko Carstens, Vasily Gorbik

On 9/16/2025 5:58 PM, Steven Rostedt wrote:
> On Mon, 15 Sep 2025 23:05:09 -0700
> Fangrui Song <maskray@sourceware.org> wrote:
> 
>> From a linker and binary utilities perspective, I'd even suggest  
>> adopting a universal little-endian format regardless of the target
>> system's native endianness.
>> This would eliminate the need for endianness templates in the C++ code
>> and simplify toolchain implementation across platforms.
>>
>> On the big-endian z/Architecture, this is efficient: the LOAD REVERSED
>> instructions are used by the bswap versions in the following program,
>> not even requiring extra instructions.
>> #define WIDTH(x) \
>> typedef __UINT##x##_TYPE__ [[gnu::aligned(1)]] uint##x; \
>> uint##x load_inc##x(uint##x *p) { return *p+1; } \
>> uint##x load_bswap_inc##x(uint##x *p) { return __builtin_bswap##x(*p)+1; }; \
>> uint##x load_eq##x(uint##x *p) { return *p==3; } \
>> uint##x load_bswap_eq##x(uint##x *p) { return __builtin_bswap##x(*p)==3; }; \
>>
>> WIDTH(16);
>> WIDTH(32);
>> WIDTH(64);
> 
> I would like to hear the comments from Jens on this, as he's adapting
> SFrames for the s390 which I believe is big-endian.

This would allow for endianness bugs for the native case only on s390
(and other big-endian architectures).

Load Reversed takes longer than a normal Load, as it obviously needs to
reverse the register contents.

Regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-09-18 10:39 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-12 17:34 Unaligned access trade-offs for SFrame FRE layout Indu Bhagat
2025-09-12 18:19 ` Segher Boessenkool
2025-09-12 19:18 ` Steven Rostedt
2025-09-13  7:56   ` Indu Bhagat
2025-09-15 16:04     ` Steven Rostedt
     [not found]   ` <CAEG7qUxk_cZYv3X_VM6+ZGaVFAD-7jdPd3xA92xYHUAqyzb2Xw@mail.gmail.com>
2025-09-13  8:01     ` Indu Bhagat
2025-09-14 14:14 ` Jan Beulich
2025-09-14 14:39   ` Rainer Orth
2025-09-14 15:23     ` Jan Beulich
2025-09-14 16:18       ` Rainer Orth
2025-09-14 18:10         ` Jan Beulich
2025-09-15  5:42           ` Indu Bhagat
2025-09-15 16:07             ` Steven Rostedt
2025-09-15 17:22               ` Segher Boessenkool
2025-09-16  6:05               ` Fangrui Song
2025-09-16 15:58                 ` Steven Rostedt
2025-09-18 10:39                   ` Jens Remus
2025-09-16 16:03                 ` Indu Bhagat
2025-09-16 16:32                   ` Fangrui Song
2025-09-16 16:44                     ` Segher Boessenkool
2025-09-16 17:05                       ` Fangrui Song
2025-09-16 17:54                       ` Segher Boessenkool
2025-09-16 17:33                     ` Indu Bhagat
2025-09-17 21:12                 ` Steven Rostedt
2025-09-17 23:55                   ` Alan Modra
2025-09-15  9:08       ` Segher Boessenkool

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).