On Tue, Jun 17, 2025 at 03:55:00PM +0100, Mark Rutland wrote:
> On Tue, Jun 17, 2025 at 04:44:16PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 17, 2025 at 03:24:01PM +0100, Mark Rutland wrote:

> > Anyway, your conditional length thing is 'fun' and has two solutions:

> >   - the arch can refuse to create per-cpu counters with SIMD samples, or

> >   - 0 pad all 'unobtainable state'.

We currently do a *bit* of the 0 for unobtainable state thing for FFR
when in !FA64 streaming mode, that's for a whole register though.
Probably also worth pointing out that we've got 16 predicate registers
plus FFR which is sized like a predicate register, I don't think it
makes much difference for this discussion but just in case.

> > Same when asking for wider vectors than the hardware supports; eg.
> > asking for 512 wide registers on Intel clients will likely end up in a
> > lot of 0s for the high bits -- seeing how AVX512 is mostly a server
> > thing on Intel.

> Yep, those options may work for us, but we'd need to think harder about
> it. Our approach for ptrace and signals has been to have a header and
> pack at the active vector length, so padding to a max width would be
> different, but maybe it's fine.

> Having another representation feels like a recipe waiting to happen.

Given that we have a different header format for everywhere we expose
the register state it's *probably* fine if the "header" is that
userspace selected the VL to record with, but like you say it is
different and therefore concerning.  We have something similar with KVM
where we expose these registers with the maximum VL we configured for
the guest regardless of what vector length the guest has configured for
itself.  It's certainly going to be more fiddly to read and write a
non-native format if you're not running in a higher EL like KVM though.

Another thought is that KVM exposes the vector lengths as virtual
registers, we could perhaps use a similar approach and write the active
VL out as part of the sample which does start to look like a header and
is perhaps not too horrifying for the perf abstractions (this being very
much a pick your poison situation)?  Even if the VL used to format the
data that's written out is fixed I'd expect we'll want to be able to
include enough state to figure out the actual VL along with it.

If we do padding I worry a bit about the overhead whenever we have to do
it.  AIUI with x86 the register sizes are constant on a given system so
userspace can simply not select a register size larger than the hardware
if they're concerned about the cost.  On arm64 when a system has both
SVE and SME we are expecting they will frequently implement different
vector lengths for each so needing to pad would be a much more common
case, it is expected that programs will only be in streaming mode for
the minimum amount of time required to do the SME operations they need.
Given that SME will tend to have the larger VL but be less frequently
used we'd probably pad more often than not by default which doesn't seem
ideal.  But having said that I have a feeling the overhead of just
recording things may be sufficiently high that the additional cost of
doing padding will be basically noise.

Like you say it needs thought.