On Tue, Jun 17, 2025 at 03:55:00PM +0100, Mark Rutland wrote: > On Tue, Jun 17, 2025 at 04:44:16PM +0200, Peter Zijlstra wrote: > > On Tue, Jun 17, 2025 at 03:24:01PM +0100, Mark Rutland wrote: > > Anyway, your conditional length thing is 'fun' and has two solutions: > > - the arch can refuse to create per-cpu counters with SIMD samples, or > > - 0 pad all 'unobtainable state'. We currently do a *bit* of the 0 for unobtainable state thing for FFR when in !FA64 streaming mode, that's for a whole register though. Probably also worth pointing out that we've got 16 predicate registers plus FFR which is sized like a predicate register, I don't think it makes much difference for this discussion but just in case. > > Same when asking for wider vectors than the hardware supports; eg. > > asking for 512 wide registers on Intel clients will likely end up in a > > lot of 0s for the high bits -- seeing how AVX512 is mostly a server > > thing on Intel. > Yep, those options may work for us, but we'd need to think harder about > it. Our approach for ptrace and signals has been to have a header and > pack at the active vector length, so padding to a max width would be > different, but maybe it's fine. > Having another representation feels like a recipe waiting to happen. Given that we have a different header format for everywhere we expose the register state it's *probably* fine if the "header" is that userspace selected the VL to record with, but like you say it is different and therefore concerning. We have something similar with KVM where we expose these registers with the maximum VL we configured for the guest regardless of what vector length the guest has configured for itself. It's certainly going to be more fiddly to read and write a non-native format if you're not running in a higher EL like KVM though. Another thought is that KVM exposes the vector lengths as virtual registers, we could perhaps use a similar approach and write the active VL out as part of the sample which does start to look like a header and is perhaps not too horrifying for the perf abstractions (this being very much a pick your poison situation)? Even if the VL used to format the data that's written out is fixed I'd expect we'll want to be able to include enough state to figure out the actual VL along with it. If we do padding I worry a bit about the overhead whenever we have to do it. AIUI with x86 the register sizes are constant on a given system so userspace can simply not select a register size larger than the hardware if they're concerned about the cost. On arm64 when a system has both SVE and SME we are expecting they will frequently implement different vector lengths for each so needing to pad would be a much more common case, it is expected that programs will only be in streaming mode for the minimum amount of time required to do the SME operations they need. Given that SME will tend to have the larger VL but be less frequently used we'd probably pad more often than not by default which doesn't seem ideal. But having said that I have a feeling the overhead of just recording things may be sufficiently high that the additional cost of doing padding will be basically noise. Like you say it needs thought.