From mboxrd@z Thu Jan 1 00:00:00 1970 From: arnd@arndb.de (Arnd Bergmann) Date: Fri, 04 Mar 2016 15:41:58 +0100 Subject: DWord alignment on ARMv7 In-Reply-To: <20160304134651.GB19428@n2100.arm.linux.org.uk> References: <56D8BA3F.7050508@pengutronix.de> <72900410.JPa1IHPXo5@wuerfel> <20160304134651.GB19428@n2100.arm.linux.org.uk> Message-ID: <2494410.DJhrOHD4yE@wuerfel> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On Friday 04 March 2016 13:46:51 Russell King - ARM Linux wrote: > On Fri, Mar 04, 2016 at 02:30:23PM +0100, Arnd Bergmann wrote: > > Ah, I thought it only required 32-bit alignment like ldm/stm, but it > > seems that it won't do that. However, an implementation like > > > > unsigned long long get_unaligned_u64(void *p) > > { > > unsigned long long upper, lower; > > lower = *(unsigned long*)p; > > upper = *(unsigned long*)(p+4); > > > > return lower | (upper << 32); > > } > > > > does get compiled into > > > > 00000000 : > > 0: e8900003 ldm r0, {r0, r1} > > 4: e12fff1e bx lr > > I think it may be something of a bitch to work around, because the > compiler is going to do stuff like that behind your back. > > The only way around that would be to bypass the compiler by using > asm(), but then you end up bypassing the instruction scheduling too. > That may not matter, as the resulting overhead may still be lower. I think the compiler is correctly optimizing the code according to what we tell it about the alignment here: the ABI explicitly specifies that variables are naturally aligned, so whenever we dereference a 32-bit pointer that is not __packed, it should assume it can safely use any instruction that requires this alignment. This is essentially what get_unaligned() is designed to work around, when we have data that is known to be possibly misaligned. Simply using the generic implementation from before 0567f5facbd ("asm-generic: allow generic unaligned access if the arch supports it") solves the problem, as we explicitly tell the compiler about the fact that data is unaligned and let it do the right thing (bytewise access on ARMv5 or lower, normal access except ldm and ldlr on later architectures). The implementation appears to be suboptimal for cross-endian loads though, as gcc-5.3 does not use the 'rev' instruction here but falls back on byte accesses. We can easily fix that by introducing one more generic implementation for the cross-endian accesses doing static __always_inline void put_unaligned_be64(u64 val, void *p) { __put_unaligned_cpu64((u64 __force)cpu_to_be64(val), p); } static __always_inline void put_unaligned_be32(u32 val, void *p) { __put_unaligned_cpu32((u32 __force)cpu_to_be32(val), p); } static __always_inline void put_unaligned_be16(u16 val, void *p) { __put_unaligned_cpu16((u16 __force)cpu_to_be16(val), p); } which is better on ARM than the currently used static inline void __put_unaligned_le16(u16 val, u8 *p) { *p++ = val; *p++ = val >> 8; } static inline void __put_unaligned_le32(u32 val, u8 *p) { __put_unaligned_le16(val >> 16, p + 2); __put_unaligned_le16(val, p); } static inline void __put_unaligned_le64(u64 val, u8 *p) { __put_unaligned_le32(val >> 32, p + 4); __put_unaligned_le32(val, p); } because it will allow using ldr/str+rev for unaligned cross-endian accesses, but disallow ldm/stm+rev. Arnd