DWord alignment on ARMv7

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* DWord alignment on ARMv7
@ 2016-03-03 22:27 Marc Kleine-Budde
  2016-03-03 23:54 ` Will Deacon
  0 siblings, 1 reply; 17+ messages in thread
From: Marc Kleine-Budde @ 2016-03-03 22:27 UTC (permalink / raw)
  To: linux-arm-kernel

Hello, (now sending to linux-arm-kernel)

I'm using btrfs on am ARMv7 and it turns out, that the kernel has to
fixup a lot of kernel originated alignment issues.

See /proc/cpu/alignment (~4h of uptime):
> System: 22304815 (btrfs_get_token_64+0x13c/0x148 [btrfs])

For example, when compiling the kernel on a btrfs volume the counter
increases by 100...1000 per second.

The function shown "btrfs_get_token_64()" is defined here:
> http://lxr.free-electrons.com/source/fs/btrfs/struct-funcs.c#L53
...it already uses get_unaligned_leXX accessors.

Quoting a comment in arch/arm/mm/alignment.c:

         * ARMv6 and later CPUs can perform unaligned accesses for
         * most single load and store instructions up to word size.
         * LDM, STM, LDRD and STRD still need to be handled.

But on a 32bit ARMv7 64bits are not word-sized.

Is the exception and fixup overhead neglectable? Do we have to introduce
something like HAVE_EFFICIENT_UNALIGNED_64BIT_ACCESS?

regards,
Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |



-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 455 bytes
Desc: OpenPGP digital signature
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20160303/78229449/attachment-0001.sig>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* DWord alignment on ARMv7
  2016-03-03 22:27 DWord alignment on ARMv7 Marc Kleine-Budde
@ 2016-03-03 23:54 ` Will Deacon
  2016-03-04  8:01   ` btrfs_get_token_64() alignment problem on ARM (was: Re: DWord alignment on ARMv7) Marc Kleine-Budde
  2016-03-04 10:48   ` DWord alignment on ARMv7 Ard Biesheuvel
  0 siblings, 2 replies; 17+ messages in thread
From: Will Deacon @ 2016-03-03 23:54 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Marc,

On Thu, Mar 03, 2016 at 11:27:11PM +0100, Marc Kleine-Budde wrote:
> I'm using btrfs on am ARMv7 and it turns out, that the kernel has to
> fixup a lot of kernel originated alignment issues.
> 
> See /proc/cpu/alignment (~4h of uptime):
> > System: 22304815 (btrfs_get_token_64+0x13c/0x148 [btrfs])
> 
> For example, when compiling the kernel on a btrfs volume the counter
> increases by 100...1000 per second.
> 
> The function shown "btrfs_get_token_64()" is defined here:
> > http://lxr.free-electrons.com/source/fs/btrfs/struct-funcs.c#L53
> ...it already uses get_unaligned_leXX accessors.
> 
> Quoting a comment in arch/arm/mm/alignment.c:
> 
>          * ARMv6 and later CPUs can perform unaligned accesses for
>          * most single load and store instructions up to word size.
>          * LDM, STM, LDRD and STRD still need to be handled.
> 
> But on a 32bit ARMv7 64bits are not word-sized.
> 
> Is the exception and fixup overhead neglectable? Do we have to introduce
> something like HAVE_EFFICIENT_UNALIGNED_64BIT_ACCESS?

Ouch, that trap/emulate is certainly going to have an effect on your
performance. I doubt that HAVE_EFFICIENT_UNALIGNED_ACCESS applies to
types bigger than the native word size on many architectures, so my
hunch is that the btrfs code should be checking BITS_PER_LONG or similar
to establish whether or not to break the access up into word accesses.

A cursory look at the network layer indicates that kind of trick is done
over there.

Will

^ permalink raw reply	[flat|nested] 17+ messages in thread

* btrfs_get_token_64() alignment problem on ARM (was: Re: DWord alignment on ARMv7)
  2016-03-03 23:54 ` Will Deacon
@ 2016-03-04  8:01   ` Marc Kleine-Budde
  2016-03-04  9:16     ` David Sterba
  2016-03-04 10:48   ` DWord alignment on ARMv7 Ard Biesheuvel
  1 sibling, 1 reply; 17+ messages in thread
From: Marc Kleine-Budde @ 2016-03-04  8:01 UTC (permalink / raw)
  To: linux-arm-kernel

Hello,

On 03/04/2016 12:54 AM, Will Deacon wrote:
> Hi Marc,
> 
> On Thu, Mar 03, 2016 at 11:27:11PM +0100, Marc Kleine-Budde wrote:
>> I'm using btrfs on am ARMv7 and it turns out, that the kernel has to
>> fixup a lot of kernel originated alignment issues.
>>
>> See /proc/cpu/alignment (~4h of uptime):
>>> System: 22304815 (btrfs_get_token_64+0x13c/0x148 [btrfs])
>>
>> For example, when compiling the kernel on a btrfs volume the counter
>> increases by 100...1000 per second.
>>
>> The function shown "btrfs_get_token_64()" is defined here:
>>> http://lxr.free-electrons.com/source/fs/btrfs/struct-funcs.c#L53
>> ...it already uses get_unaligned_leXX accessors.
>>
>> Quoting a comment in arch/arm/mm/alignment.c:
>>
>>          * ARMv6 and later CPUs can perform unaligned accesses for
>>          * most single load and store instructions up to word size.
>>          * LDM, STM, LDRD and STRD still need to be handled.
>>
>> But on a 32bit ARMv7 64bits are not word-sized.
>>
>> Is the exception and fixup overhead neglectable? Do we have to introduce
>> something like HAVE_EFFICIENT_UNALIGNED_64BIT_ACCESS?
> 
> Ouch, that trap/emulate is certainly going to have an effect on your
> performance. I doubt that HAVE_EFFICIENT_UNALIGNED_ACCESS applies to
> types bigger than the native word size on many architectures, so my
> hunch is that the btrfs code should be checking BITS_PER_LONG or similar
> to establish whether or not to break the access up into word accesses.

I've added the btrfs maintainers on Cc.

> A cursory look at the network layer indicates that kind of trick is done
> over there.

I stumbled over this, too.

Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 455 bytes
Desc: OpenPGP digital signature
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20160304/31538889/attachment.sig>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* btrfs_get_token_64() alignment problem on ARM (was: Re: DWord alignment on ARMv7)
  2016-03-04  8:01   ` btrfs_get_token_64() alignment problem on ARM (was: Re: DWord alignment on ARMv7) Marc Kleine-Budde
@ 2016-03-04  9:16     ` David Sterba
  0 siblings, 0 replies; 17+ messages in thread
From: David Sterba @ 2016-03-04  9:16 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Mar 04, 2016 at 09:01:44AM +0100, Marc Kleine-Budde wrote:
> Hello,
> 
> On 03/04/2016 12:54 AM, Will Deacon wrote:
> > Hi Marc,
> > 
> > On Thu, Mar 03, 2016 at 11:27:11PM +0100, Marc Kleine-Budde wrote:
> >> I'm using btrfs on am ARMv7 and it turns out, that the kernel has to
> >> fixup a lot of kernel originated alignment issues.
> >>
> >> See /proc/cpu/alignment (~4h of uptime):
> >>> System: 22304815 (btrfs_get_token_64+0x13c/0x148 [btrfs])
> >>
> >> For example, when compiling the kernel on a btrfs volume the counter
> >> increases by 100...1000 per second.
> >>
> >> The function shown "btrfs_get_token_64()" is defined here:
> >>> http://lxr.free-electrons.com/source/fs/btrfs/struct-funcs.c#L53
> >> ...it already uses get_unaligned_leXX accessors.
> >>
> >> Quoting a comment in arch/arm/mm/alignment.c:
> >>
> >>          * ARMv6 and later CPUs can perform unaligned accesses for
> >>          * most single load and store instructions up to word size.
> >>          * LDM, STM, LDRD and STRD still need to be handled.
> >>
> >> But on a 32bit ARMv7 64bits are not word-sized.
> >>
> >> Is the exception and fixup overhead neglectable? Do we have to introduce
> >> something like HAVE_EFFICIENT_UNALIGNED_64BIT_ACCESS?
> > 
> > Ouch, that trap/emulate is certainly going to have an effect on your
> > performance. I doubt that HAVE_EFFICIENT_UNALIGNED_ACCESS applies to
> > types bigger than the native word size on many architectures, so my
> > hunch is that the btrfs code should be checking BITS_PER_LONG or similar
> > to establish whether or not to break the access up into word accesses.
> 
> I've added the btrfs maintainers on Cc.

Can this be done transparently via the the get_unaligned_le* helpers?
This seems to be too arch-specific to fix it in btrfs.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* DWord alignment on ARMv7
  2016-03-03 23:54 ` Will Deacon
  2016-03-04  8:01   ` btrfs_get_token_64() alignment problem on ARM (was: Re: DWord alignment on ARMv7) Marc Kleine-Budde
@ 2016-03-04 10:48   ` Ard Biesheuvel
  2016-03-04 11:02     ` Russell King - ARM Linux
  1 sibling, 1 reply; 17+ messages in thread
From: Ard Biesheuvel @ 2016-03-04 10:48 UTC (permalink / raw)
  To: linux-arm-kernel

(+ Russell, Arnd)

On 4 March 2016 at 00:54, Will Deacon <will.deacon@arm.com> wrote:
> Hi Marc,
>
> On Thu, Mar 03, 2016 at 11:27:11PM +0100, Marc Kleine-Budde wrote:
>> I'm using btrfs on am ARMv7 and it turns out, that the kernel has to
>> fixup a lot of kernel originated alignment issues.
>>
>> See /proc/cpu/alignment (~4h of uptime):
>> > System: 22304815 (btrfs_get_token_64+0x13c/0x148 [btrfs])
>>
>> For example, when compiling the kernel on a btrfs volume the counter
>> increases by 100...1000 per second.
>>
>> The function shown "btrfs_get_token_64()" is defined here:
>> > http://lxr.free-electrons.com/source/fs/btrfs/struct-funcs.c#L53
>> ...it already uses get_unaligned_leXX accessors.
>>
>> Quoting a comment in arch/arm/mm/alignment.c:
>>
>>          * ARMv6 and later CPUs can perform unaligned accesses for
>>          * most single load and store instructions up to word size.
>>          * LDM, STM, LDRD and STRD still need to be handled.
>>
>> But on a 32bit ARMv7 64bits are not word-sized.
>>
>> Is the exception and fixup overhead neglectable? Do we have to introduce
>> something like HAVE_EFFICIENT_UNALIGNED_64BIT_ACCESS?
>
> Ouch, that trap/emulate is certainly going to have an effect on your
> performance. I doubt that HAVE_EFFICIENT_UNALIGNED_ACCESS applies to
> types bigger than the native word size on many architectures, so my
> hunch is that the btrfs code should be checking BITS_PER_LONG or similar
> to establish whether or not to break the access up into word accesses.
>
> A cursory look at the network layer indicates that kind of trick is done
> over there.
>

I don't think it is the job of the filesystem driver to reason about
whether get_unaligned_le64() does the right thing under any particular
configuration. If ARM's implementation of get_unaligned_le64() issues
load instructions that result in a trap, it is misbehaving and should
be fixed.

I think that something like HAVE_EFFICIENT_UNALIGNED_64BIT_ACCESS
makes sense. Or, based on Will's comment regarding other
architectures, we could change the generic get_unaligned_le64()
implementation for HAVE_EFFICIENT_UNALIGNED_ACCESS to split accesses
when running on 32-bit archs.

-- 
Ard.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* DWord alignment on ARMv7
  2016-03-04 10:48   ` DWord alignment on ARMv7 Ard Biesheuvel
@ 2016-03-04 11:02     ` Russell King - ARM Linux
  2016-03-04 11:14       ` Ard Biesheuvel
  0 siblings, 1 reply; 17+ messages in thread
From: Russell King - ARM Linux @ 2016-03-04 11:02 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Mar 04, 2016 at 11:48:10AM +0100, Ard Biesheuvel wrote:
> I don't think it is the job of the filesystem driver to reason about
> whether get_unaligned_le64() does the right thing under any particular
> configuration. If ARM's implementation of get_unaligned_le64() issues
> load instructions that result in a trap, it is misbehaving and should
> be fixed.

It's not ARMs implementation, we don't have our own implementation, but
we seem to (today) use asm-generic stuff, which is sub-optimal.

Looking at the state of that, I guess we need to implement our own
asm/unaligned.h - and as the asm-generic stuff assumes that all
access sizes fall into the same categories, I'm guessing we'll need
to implement _all_ accessors ourselves.

That really sounds very sub-optimal, but I don't see any other solution
which wouldn't make the asm-generic stuff even more painful to follow
through multiple include files than it already is today.

-- 
RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* DWord alignment on ARMv7
  2016-03-04 11:02     ` Russell King - ARM Linux
@ 2016-03-04 11:14       ` Ard Biesheuvel
  2016-03-04 11:19         ` Marc Kleine-Budde
  2016-03-04 11:38         ` Arnd Bergmann
  0 siblings, 2 replies; 17+ messages in thread
From: Ard Biesheuvel @ 2016-03-04 11:14 UTC (permalink / raw)
  To: linux-arm-kernel

On 4 March 2016 at 12:02, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Fri, Mar 04, 2016 at 11:48:10AM +0100, Ard Biesheuvel wrote:
>> I don't think it is the job of the filesystem driver to reason about
>> whether get_unaligned_le64() does the right thing under any particular
>> configuration. If ARM's implementation of get_unaligned_le64() issues
>> load instructions that result in a trap, it is misbehaving and should
>> be fixed.
>
> It's not ARMs implementation, we don't have our own implementation, but
> we seem to (today) use asm-generic stuff, which is sub-optimal.
>

Indeed. I was not suggesting that ARM carries broken code, simply that
btrfs should not have to worry that it gets built on a platform that
requires extra care when invoking get_unaligned_le64()

> Looking at the state of that, I guess we need to implement our own
> asm/unaligned.h - and as the asm-generic stuff assumes that all
> access sizes fall into the same categories, I'm guessing we'll need
> to implement _all_ accessors ourselves.
>
> That really sounds very sub-optimal, but I don't see any other solution
> which wouldn't make the asm-generic stuff even more painful to follow
> through multiple include files than it already is today.
>

I wonder if we should simply apply something like the patch below
(untested): it depends on how many 32-bit architectures implement
double word load instructions, but for ones that don't, the patch
shouldn't change anything, nor should it change anything for 64-bit
architectures.



-------8<-----------
diff --git a/include/linux/unaligned/access_ok.h
b/include/linux/unaligned/access_ok.h
index 99c1b4d20b0f..019d0b7ea6a3 100644
--- a/include/linux/unaligned/access_ok.h
+++ b/include/linux/unaligned/access_ok.h
@@ -16,7 +16,11 @@ static inline u32 get_unaligned_le32(const void *p)

 static inline u64 get_unaligned_le64(const void *p)
 {
-       return le64_to_cpup((__le64 *)p);
+       if (BITS_PER_LONG == 64)
+               return le64_to_cpup((__le64 *)p);
+       else
+               return ((u64)le32_to_cpup((__le32 *)p)) |
+                      ((u64)le32_to_cpup((__le32 *)p + 1) << 32);
 }

 static inline u16 get_unaligned_be16(const void *p)
@@ -31,7 +35,11 @@ static inline u32 get_unaligned_be32(const void *p)

 static inline u64 get_unaligned_be64(const void *p)
 {
-       return be64_to_cpup((__be64 *)p);
+       if (BITS_PER_LONG == 64)
+               return be64_to_cpup((__be64 *)p);
+       else
+               return ((u64)be32_to_cpup((__be32 *)p) << 32) |
+                      ((u64)be32_to_cpup((__be32 *)p + 1));
 }

 static inline void put_unaligned_le16(u16 val, void *p)
@@ -46,7 +54,12 @@ static inline void put_unaligned_le32(u32 val, void *p)

 static inline void put_unaligned_le64(u64 val, void *p)
 {
-       *((__le64 *)p) = cpu_to_le64(val);
+       if (BITS_PER_LONG == 64) {
+               *((__le64 *)p) = cpu_to_le64(val);
+       } else {
+               *((__le32 *)p) = cpu_to_le32(val);
+               *((__le32 *)p + 1) = cpu_to_le32(val >> 32);
+       }
 }

 static inline void put_unaligned_be16(u16 val, void *p)
@@ -61,7 +74,12 @@ static inline void put_unaligned_be32(u32 val, void *p)

 static inline void put_unaligned_be64(u64 val, void *p)
 {
-       *((__be64 *)p) = cpu_to_be64(val);
+       if (BITS_PER_LONG == 64) {
+               *((__be64 *)p) = cpu_to_be64(val);
+       } else {
+               *((__be32 *)p) = cpu_to_le32(val >> 32);
+               *((__be32 *)p + 1) = cpu_to_le32(val);
+       }
 }

 #endif /* _LINUX_UNALIGNED_ACCESS_OK_H */

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* DWord alignment on ARMv7
  2016-03-04 11:14       ` Ard Biesheuvel
@ 2016-03-04 11:19         ` Marc Kleine-Budde
  2016-03-04 11:26           ` Ard Biesheuvel
  2016-03-04 11:38         ` Arnd Bergmann
  1 sibling, 1 reply; 17+ messages in thread
From: Marc Kleine-Budde @ 2016-03-04 11:19 UTC (permalink / raw)
  To: linux-arm-kernel

On 03/04/2016 12:14 PM, Ard Biesheuvel wrote:
> I wonder if we should simply apply something like the patch below
> (untested): it depends on how many 32-bit architectures implement

The "Last line effect" hit here.

> double word load instructions, but for ones that don't, the patch
> shouldn't change anything, nor should it change anything for 64-bit
> architectures.

> -------8<-----------
> diff --git a/include/linux/unaligned/access_ok.h
> b/include/linux/unaligned/access_ok.h
> index 99c1b4d20b0f..019d0b7ea6a3 100644
> --- a/include/linux/unaligned/access_ok.h
> +++ b/include/linux/unaligned/access_ok.h
> @@ -16,7 +16,11 @@ static inline u32 get_unaligned_le32(const void *p)
> 
>  static inline u64 get_unaligned_le64(const void *p)
>  {
> -       return le64_to_cpup((__le64 *)p);
> +       if (BITS_PER_LONG == 64)
> +               return le64_to_cpup((__le64 *)p);
> +       else
> +               return ((u64)le32_to_cpup((__le32 *)p)) |
> +                      ((u64)le32_to_cpup((__le32 *)p + 1) << 32);
>  }
> 
>  static inline u16 get_unaligned_be16(const void *p)
> @@ -31,7 +35,11 @@ static inline u32 get_unaligned_be32(const void *p)
> 
>  static inline u64 get_unaligned_be64(const void *p)
>  {
> -       return be64_to_cpup((__be64 *)p);
> +       if (BITS_PER_LONG == 64)
> +               return be64_to_cpup((__be64 *)p);
> +       else
> +               return ((u64)be32_to_cpup((__be32 *)p) << 32) |
> +                      ((u64)be32_to_cpup((__be32 *)p + 1));
>  }
> 
>  static inline void put_unaligned_le16(u16 val, void *p)
> @@ -46,7 +54,12 @@ static inline void put_unaligned_le32(u32 val, void *p)
> 
>  static inline void put_unaligned_le64(u64 val, void *p)
>  {
> -       *((__le64 *)p) = cpu_to_le64(val);
> +       if (BITS_PER_LONG == 64) {
> +               *((__le64 *)p) = cpu_to_le64(val);
> +       } else {
> +               *((__le32 *)p) = cpu_to_le32(val);
> +               *((__le32 *)p + 1) = cpu_to_le32(val >> 32);
> +       }
>  }
> 
>  static inline void put_unaligned_be16(u16 val, void *p)
> @@ -61,7 +74,12 @@ static inline void put_unaligned_be32(u32 val, void *p)
> 
>  static inline void put_unaligned_be64(u64 val, void *p)
>  {
> -       *((__be64 *)p) = cpu_to_be64(val);
> +       if (BITS_PER_LONG == 64) {
> +               *((__be64 *)p) = cpu_to_be64(val);
> +       } else {
> +               *((__be32 *)p) = cpu_to_le32(val >> 32);
                                          be32
> +               *((__be32 *)p + 1) = cpu_to_le32(val);
                                              be32

Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 455 bytes
Desc: OpenPGP digital signature
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20160304/eab425e3/attachment-0001.sig>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* DWord alignment on ARMv7
  2016-03-04 11:19         ` Marc Kleine-Budde
@ 2016-03-04 11:26           ` Ard Biesheuvel
  0 siblings, 0 replies; 17+ messages in thread
From: Ard Biesheuvel @ 2016-03-04 11:26 UTC (permalink / raw)
  To: linux-arm-kernel

On 4 March 2016 at 12:19, Marc Kleine-Budde <mkl@pengutronix.de> wrote:
> On 03/04/2016 12:14 PM, Ard Biesheuvel wrote:
>> I wonder if we should simply apply something like the patch below
>> (untested): it depends on how many 32-bit architectures implement
>
> The "Last line effect" hit here.
>
>> double word load instructions, but for ones that don't, the patch
>> shouldn't change anything, nor should it change anything for 64-bit
>> architectures.
>
>> -------8<-----------
>> diff --git a/include/linux/unaligned/access_ok.h
>> b/include/linux/unaligned/access_ok.h
>> index 99c1b4d20b0f..019d0b7ea6a3 100644
>> --- a/include/linux/unaligned/access_ok.h
>> +++ b/include/linux/unaligned/access_ok.h
>> @@ -16,7 +16,11 @@ static inline u32 get_unaligned_le32(const void *p)
>>
>>  static inline u64 get_unaligned_le64(const void *p)
>>  {
>> -       return le64_to_cpup((__le64 *)p);
>> +       if (BITS_PER_LONG == 64)
>> +               return le64_to_cpup((__le64 *)p);
>> +       else
>> +               return ((u64)le32_to_cpup((__le32 *)p)) |
>> +                      ((u64)le32_to_cpup((__le32 *)p + 1) << 32);
>>  }
>>
>>  static inline u16 get_unaligned_be16(const void *p)
>> @@ -31,7 +35,11 @@ static inline u32 get_unaligned_be32(const void *p)
>>
>>  static inline u64 get_unaligned_be64(const void *p)
>>  {
>> -       return be64_to_cpup((__be64 *)p);
>> +       if (BITS_PER_LONG == 64)
>> +               return be64_to_cpup((__be64 *)p);
>> +       else
>> +               return ((u64)be32_to_cpup((__be32 *)p) << 32) |
>> +                      ((u64)be32_to_cpup((__be32 *)p + 1));
>>  }
>>
>>  static inline void put_unaligned_le16(u16 val, void *p)
>> @@ -46,7 +54,12 @@ static inline void put_unaligned_le32(u32 val, void *p)
>>
>>  static inline void put_unaligned_le64(u64 val, void *p)
>>  {
>> -       *((__le64 *)p) = cpu_to_le64(val);
>> +       if (BITS_PER_LONG == 64) {
>> +               *((__le64 *)p) = cpu_to_le64(val);
>> +       } else {
>> +               *((__le32 *)p) = cpu_to_le32(val);
>> +               *((__le32 *)p + 1) = cpu_to_le32(val >> 32);
>> +       }
>>  }
>>
>>  static inline void put_unaligned_be16(u16 val, void *p)
>> @@ -61,7 +74,12 @@ static inline void put_unaligned_be32(u32 val, void *p)
>>
>>  static inline void put_unaligned_be64(u64 val, void *p)
>>  {
>> -       *((__be64 *)p) = cpu_to_be64(val);
>> +       if (BITS_PER_LONG == 64) {
>> +               *((__be64 *)p) = cpu_to_be64(val);
>> +       } else {
>> +               *((__be32 *)p) = cpu_to_le32(val >> 32);
>                                           be32
>> +               *((__be32 *)p + 1) = cpu_to_le32(val);
>                                               be32
>

Ah yes, thanks for spotting that.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* DWord alignment on ARMv7
  2016-03-04 11:14       ` Ard Biesheuvel
  2016-03-04 11:19         ` Marc Kleine-Budde
@ 2016-03-04 11:38         ` Arnd Bergmann
  2016-03-04 11:44           ` Ard Biesheuvel
  1 sibling, 1 reply; 17+ messages in thread
From: Arnd Bergmann @ 2016-03-04 11:38 UTC (permalink / raw)
  To: linux-arm-kernel

On Friday 04 March 2016 12:14:24 Ard Biesheuvel wrote:
> On 4 March 2016 at 12:02, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
> > On Fri, Mar 04, 2016 at 11:48:10AM +0100, Ard Biesheuvel wrote:
> >> I don't think it is the job of the filesystem driver to reason about
> >> whether get_unaligned_le64() does the right thing under any particular
> >> configuration. If ARM's implementation of get_unaligned_le64() issues
> >> load instructions that result in a trap, it is misbehaving and should
> >> be fixed.
> >
> > It's not ARMs implementation, we don't have our own implementation, but
> > we seem to (today) use asm-generic stuff, which is sub-optimal.
> >
> 
> Indeed. I was not suggesting that ARM carries broken code, simply that
> btrfs should not have to worry that it gets built on a platform that
> requires extra care when invoking get_unaligned_le64()
> 
> > Looking at the state of that, I guess we need to implement our own
> > asm/unaligned.h - and as the asm-generic stuff assumes that all
> > access sizes fall into the same categories, I'm guessing we'll need
> > to implement _all_ accessors ourselves.
> >
> > That really sounds very sub-optimal, but I don't see any other solution
> > which wouldn't make the asm-generic stuff even more painful to follow
> > through multiple include files than it already is today.
> >
> 
> I wonder if we should simply apply something like the patch below
> (untested): it depends on how many 32-bit architectures implement
> double word load instructions, but for ones that don't, the patch
> shouldn't change anything, nor should it change anything for 64-bit
> architectures.

I think it's wrong to change the common code, it's unlikely that
any other architectures have this specific requirement.

Here is a patch I've come up with independently. I have verified
that it removes all ldrd/strd from the btrfs unaligned data
handling.

The open question about it is whether we'd rather play safe and
let the compiler handle unaligned accesses itself, removing the
theoretical risk of the compiler optimizing

	void *p;
	u64 v = get_unaligned((u32)p) + (get_unaligned((u32)(p + 4)) << 32);

into an ldrd. I think the linux/unaligned/access_ok.h implementation
would allow that.

	Arnd

commit abf1d8ecc9d88c16dbb72ec902eee79fe5edd2dc
Author: Arnd Bergmann <arnd@arndb.de>
Date:   Fri Mar 4 12:28:09 2016 +0100

    ARM: avoid ldrd/strd for get/put_unaligned
    
    The include/linux/unaligned/access_ok.h header tells the compiler
    that it can pretend all pointers to be aligned. This is not true
    on ARM, as 64-bit variables are typically accessed using ldrd/strd,
    which require 32-bit alignment.
    
    This introduces an architecture specific asm/unaligned.h header
    that uses the normal "everything is fine" implementation for 16-bit
    and 32-bit accesses, but uses the struct based implementation for
    64-bit accesses, which the compiler is smart enough to handle using
    multiple 32-bit acceses.
    
    Signed-off-by: Arnd Bergmann <arnd@arndb.de>

diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild
index 55e0e3ea9cb6..bd12b98e2589 100644
--- a/arch/arm/include/asm/Kbuild
+++ b/arch/arm/include/asm/Kbuild
@@ -37,4 +37,3 @@ generic-y += termbits.h
 generic-y += termios.h
 generic-y += timex.h
 generic-y += trace_clock.h
-generic-y += unaligned.h
diff --git a/arch/arm/include/asm/unaligned.h b/arch/arm/include/asm/unaligned.h
new file mode 100644
index 000000000000..4cb7b641778a
--- /dev/null
+++ b/arch/arm/include/asm/unaligned.h
@@ -0,0 +1,108 @@
+#ifndef _ARM_ASM_UNALIGNED_H
+#define _ARM_ASM_UNALIGNED_H
+
+/*
+ * This is the most generic implementation of unaligned accesses
+ * and should work almost anywhere.
+ */
+#include <asm/byteorder.h>
+#include <linux/kernel.h>
+#include <asm/byteorder.h>
+
+/* Set by the arch if it can handle unaligned accesses in hardware. */
+#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+#include <linux/unaligned/packed_struct.h>
+
+/*
+ * These are copied from include/linux/unaligned/access_ok.h:
+ * we pretend everything is fully aligned and let the compiler
+ * always issue normal loads/stores. This isn't entirely correct
+ * as we effectively cast a pointer from smaller alignment
+ * to larger alignment, but it works fine until gcc gets smart
+ * enough to merge multiple consecutive get_unaligned() calls
+ * into a single ldm/stm or ldrd/strd instruction.
+ */
+static __always_inline u16 get_unaligned_le16(const void *p)
+{
+	return le16_to_cpup((__le16 *)p);
+}
+
+static __always_inline u32 get_unaligned_le32(const void *p)
+{
+	return le32_to_cpup((__le32 *)p);
+}
+
+static __always_inline u16 get_unaligned_be16(const void *p)
+{
+	return be16_to_cpup((__be16 *)p);
+}
+
+static __always_inline u32 get_unaligned_be32(const void *p)
+{
+	return be32_to_cpup((__be32 *)p);
+}
+
+static __always_inline void put_unaligned_le16(u16 val, void *p)
+{
+	*((__le16 *)p) = cpu_to_le16(val);
+}
+
+static __always_inline void put_unaligned_le32(u32 val, void *p)
+{
+	*((__le32 *)p) = cpu_to_le32(val);
+}
+
+static __always_inline void put_unaligned_be16(u16 val, void *p)
+{
+	*((__be16 *)p) = cpu_to_be16(val);
+}
+
+static __always_inline void put_unaligned_be32(u32 val, void *p)
+{
+	*((__be32 *)p) = cpu_to_be32(val);
+}
+
+/*
+ * These use the packet_struct implementation to split up a
+ * potentially unaligned ldrd/strd into two 32-bit accesses
+ */
+static __always_inline u64 get_unaligned_le64(const void *p)
+{
+	return le64_to_cpu((__le64 __force)__get_unaligned_cpu64(p));
+}
+
+static __always_inline u64 get_unaligned_be64(const void *p)
+{
+	return be64_to_cpu((__be64 __force)__get_unaligned_cpu64(p));
+}
+
+static __always_inline void put_unaligned_le64(u64 val, void *p)
+{
+	__put_unaligned_cpu64((u64 __force)cpu_to_le64(val), p);
+}
+
+static __always_inline void put_unaligned_be64(u64 val, void *p)
+{
+	__put_unaligned_cpu64((u64 __force)cpu_to_be64(val), p);
+}
+#endif /* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */
+
+#if defined(__LITTLE_ENDIAN)
+# ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+#  include <linux/unaligned/le_struct.h>
+#  include <linux/unaligned/be_byteshift.h>
+# endif
+# include <linux/unaligned/generic.h>
+# define get_unaligned	__get_unaligned_le
+# define put_unaligned	__put_unaligned_le
+#elif defined(__BIG_ENDIAN)
+# ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+#  include <linux/unaligned/be_struct.h>
+#  include <linux/unaligned/le_byteshift.h>
+# endif
+# include <linux/unaligned/generic.h>
+# define get_unaligned	__get_unaligned_be
+# define put_unaligned	__put_unaligned_be
+#endif
+
+#endif /* _ARM_ASM_UNALIGNED_H */

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* DWord alignment on ARMv7
  2016-03-04 11:38         ` Arnd Bergmann
@ 2016-03-04 11:44           ` Ard Biesheuvel
  2016-03-04 13:30             ` Arnd Bergmann
  0 siblings, 1 reply; 17+ messages in thread
From: Ard Biesheuvel @ 2016-03-04 11:44 UTC (permalink / raw)
  To: linux-arm-kernel

On 4 March 2016 at 12:38, Arnd Bergmann <arnd@arndb.de> wrote:
> On Friday 04 March 2016 12:14:24 Ard Biesheuvel wrote:
>> On 4 March 2016 at 12:02, Russell King - ARM Linux
>> <linux@arm.linux.org.uk> wrote:
>> > On Fri, Mar 04, 2016 at 11:48:10AM +0100, Ard Biesheuvel wrote:
>> >> I don't think it is the job of the filesystem driver to reason about
>> >> whether get_unaligned_le64() does the right thing under any particular
>> >> configuration. If ARM's implementation of get_unaligned_le64() issues
>> >> load instructions that result in a trap, it is misbehaving and should
>> >> be fixed.
>> >
>> > It's not ARMs implementation, we don't have our own implementation, but
>> > we seem to (today) use asm-generic stuff, which is sub-optimal.
>> >
>>
>> Indeed. I was not suggesting that ARM carries broken code, simply that
>> btrfs should not have to worry that it gets built on a platform that
>> requires extra care when invoking get_unaligned_le64()
>>
>> > Looking at the state of that, I guess we need to implement our own
>> > asm/unaligned.h - and as the asm-generic stuff assumes that all
>> > access sizes fall into the same categories, I'm guessing we'll need
>> > to implement _all_ accessors ourselves.
>> >
>> > That really sounds very sub-optimal, but I don't see any other solution
>> > which wouldn't make the asm-generic stuff even more painful to follow
>> > through multiple include files than it already is today.
>> >
>>
>> I wonder if we should simply apply something like the patch below
>> (untested): it depends on how many 32-bit architectures implement
>> double word load instructions, but for ones that don't, the patch
>> shouldn't change anything, nor should it change anything for 64-bit
>> architectures.
>
> I think it's wrong to change the common code, it's unlikely that
> any other architectures have this specific requirement.
>

In general, yes. In practice, it depends on whether any 32-bit
architectures have double word loads that can perform arbitrary
unaligned accesses.

> Here is a patch I've come up with independently. I have verified
> that it removes all ldrd/strd from the btrfs unaligned data
> handling.
>
> The open question about it is whether we'd rather play safe and
> let the compiler handle unaligned accesses itself, removing the
> theoretical risk of the compiler optimizing
>
>         void *p;
>         u64 v = get_unaligned((u32)p) + (get_unaligned((u32)(p + 4)) << 32);
>
> into an ldrd. I think the linux/unaligned/access_ok.h implementation
> would allow that.
>

I would assume that the compiler engineers are aware of the alignment
requirement of ldrd/strd, and don't promote adjacent accesses like
that if the pointer may not be 64-bit aligned.

> commit abf1d8ecc9d88c16dbb72ec902eee79fe5edd2dc
> Author: Arnd Bergmann <arnd@arndb.de>
> Date:   Fri Mar 4 12:28:09 2016 +0100
>
>     ARM: avoid ldrd/strd for get/put_unaligned
>
>     The include/linux/unaligned/access_ok.h header tells the compiler
>     that it can pretend all pointers to be aligned. This is not true
>     on ARM, as 64-bit variables are typically accessed using ldrd/strd,
>     which require 32-bit alignment.
>
>     This introduces an architecture specific asm/unaligned.h header
>     that uses the normal "everything is fine" implementation for 16-bit
>     and 32-bit accesses, but uses the struct based implementation for
>     64-bit accesses, which the compiler is smart enough to handle using
>     multiple 32-bit acceses.
>
>     Signed-off-by: Arnd Bergmann <arnd@arndb.de>
>
> diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild
> index 55e0e3ea9cb6..bd12b98e2589 100644
> --- a/arch/arm/include/asm/Kbuild
> +++ b/arch/arm/include/asm/Kbuild
> @@ -37,4 +37,3 @@ generic-y += termbits.h
>  generic-y += termios.h
>  generic-y += timex.h
>  generic-y += trace_clock.h
> -generic-y += unaligned.h
> diff --git a/arch/arm/include/asm/unaligned.h b/arch/arm/include/asm/unaligned.h
> new file mode 100644
> index 000000000000..4cb7b641778a
> --- /dev/null
> +++ b/arch/arm/include/asm/unaligned.h
> @@ -0,0 +1,108 @@
> +#ifndef _ARM_ASM_UNALIGNED_H
> +#define _ARM_ASM_UNALIGNED_H
> +
> +/*
> + * This is the most generic implementation of unaligned accesses
> + * and should work almost anywhere.
> + */
> +#include <asm/byteorder.h>
> +#include <linux/kernel.h>
> +#include <asm/byteorder.h>

Any particular reason to include this twice?

> +
> +/* Set by the arch if it can handle unaligned accesses in hardware. */
> +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> +#include <linux/unaligned/packed_struct.h>
> +
> +/*
> + * These are copied from include/linux/unaligned/access_ok.h:
> + * we pretend everything is fully aligned and let the compiler
> + * always issue normal loads/stores. This isn't entirely correct
> + * as we effectively cast a pointer from smaller alignment
> + * to larger alignment, but it works fine until gcc gets smart
> + * enough to merge multiple consecutive get_unaligned() calls
> + * into a single ldm/stm or ldrd/strd instruction.
> + */
> +static __always_inline u16 get_unaligned_le16(const void *p)
> +{
> +       return le16_to_cpup((__le16 *)p);
> +}
> +
> +static __always_inline u32 get_unaligned_le32(const void *p)
> +{
> +       return le32_to_cpup((__le32 *)p);
> +}
> +
> +static __always_inline u16 get_unaligned_be16(const void *p)
> +{
> +       return be16_to_cpup((__be16 *)p);
> +}
> +
> +static __always_inline u32 get_unaligned_be32(const void *p)
> +{
> +       return be32_to_cpup((__be32 *)p);
> +}
> +
> +static __always_inline void put_unaligned_le16(u16 val, void *p)
> +{
> +       *((__le16 *)p) = cpu_to_le16(val);
> +}
> +
> +static __always_inline void put_unaligned_le32(u32 val, void *p)
> +{
> +       *((__le32 *)p) = cpu_to_le32(val);
> +}
> +
> +static __always_inline void put_unaligned_be16(u16 val, void *p)
> +{
> +       *((__be16 *)p) = cpu_to_be16(val);
> +}
> +
> +static __always_inline void put_unaligned_be32(u32 val, void *p)
> +{
> +       *((__be32 *)p) = cpu_to_be32(val);
> +}
> +
> +/*
> + * These use the packet_struct implementation to split up a
> + * potentially unaligned ldrd/strd into two 32-bit accesses
> + */
> +static __always_inline u64 get_unaligned_le64(const void *p)
> +{
> +       return le64_to_cpu((__le64 __force)__get_unaligned_cpu64(p));
> +}
> +
> +static __always_inline u64 get_unaligned_be64(const void *p)
> +{
> +       return be64_to_cpu((__be64 __force)__get_unaligned_cpu64(p));
> +}
> +
> +static __always_inline void put_unaligned_le64(u64 val, void *p)
> +{
> +       __put_unaligned_cpu64((u64 __force)cpu_to_le64(val), p);
> +}
> +
> +static __always_inline void put_unaligned_be64(u64 val, void *p)
> +{
> +       __put_unaligned_cpu64((u64 __force)cpu_to_be64(val), p);
> +}
> +#endif /* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */
> +
> +#if defined(__LITTLE_ENDIAN)
> +# ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> +#  include <linux/unaligned/le_struct.h>
> +#  include <linux/unaligned/be_byteshift.h>
> +# endif
> +# include <linux/unaligned/generic.h>
> +# define get_unaligned __get_unaligned_le
> +# define put_unaligned __put_unaligned_le
> +#elif defined(__BIG_ENDIAN)
> +# ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> +#  include <linux/unaligned/be_struct.h>
> +#  include <linux/unaligned/le_byteshift.h>
> +# endif
> +# include <linux/unaligned/generic.h>
> +# define get_unaligned __get_unaligned_be
> +# define put_unaligned __put_unaligned_be
> +#endif
> +
> +#endif /* _ARM_ASM_UNALIGNED_H */
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* DWord alignment on ARMv7
  2016-03-04 11:44           ` Ard Biesheuvel
@ 2016-03-04 13:30             ` Arnd Bergmann
  2016-03-04 13:33               ` Ard Biesheuvel
  2016-03-04 13:46               ` Russell King - ARM Linux
  0 siblings, 2 replies; 17+ messages in thread
From: Arnd Bergmann @ 2016-03-04 13:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Friday 04 March 2016 12:44:23 Ard Biesheuvel wrote:
> On 4 March 2016 at 12:38, Arnd Bergmann <arnd@arndb.de> wrote:
> > On Friday 04 March 2016 12:14:24 Ard Biesheuvel wrote:
> >> On 4 March 2016 at 12:02, Russell King - ARM Linux
> > Here is a patch I've come up with independently. I have verified
> > that it removes all ldrd/strd from the btrfs unaligned data
> > handling.
> >
> > The open question about it is whether we'd rather play safe and
> > let the compiler handle unaligned accesses itself, removing the
> > theoretical risk of the compiler optimizing
> >
> >         void *p;
> >         u64 v = get_unaligned((u32)p) + (get_unaligned((u32)(p + 4)) << 32);
> >
> > into an ldrd. I think the linux/unaligned/access_ok.h implementation
> > would allow that.
> >
> 
> I would assume that the compiler engineers are aware of the alignment
> requirement of ldrd/strd, and don't promote adjacent accesses like
> that if the pointer may not be 64-bit aligned.

Ah, I thought it only required 32-bit alignment like ldm/stm, but it
seems that it won't do that. However, an implementation like

unsigned long long get_unaligned_u64(void *p)
{
        unsigned long long upper, lower;
        lower = *(unsigned long*)p;
        upper = *(unsigned long*)(p+4);

        return lower | (upper << 32);
}

does get compiled into

00000000 <f>:
   0:   e8900003        ldm     r0, {r0, r1}
   4:   e12fff1e        bx      lr

which is still wrong, so I assume there is some danger of that remaining
with both of our patches, as the compiler might  decide to merge
a series of unaligned 32-bit loads into an ldm, as long as our implementation
incorrectly tells the compiler that the data is 32-bit aligned.

> > + * This is the most generic implementation of unaligned accesses
> > + * and should work almost anywhere.
> > + */
> > +#include <asm/byteorder.h>
> > +#include <linux/kernel.h>
> > +#include <asm/byteorder.h>
> 
> Any particular reason to include this twice?

No, just a mistake when merging the access_ok.h into this file.

	Arnd

^ permalink raw reply	[flat|nested] 17+ messages in thread

* DWord alignment on ARMv7
  2016-03-04 13:30             ` Arnd Bergmann
@ 2016-03-04 13:33               ` Ard Biesheuvel
  2016-03-04 13:46               ` Russell King - ARM Linux
  1 sibling, 0 replies; 17+ messages in thread
From: Ard Biesheuvel @ 2016-03-04 13:33 UTC (permalink / raw)
  To: linux-arm-kernel

On 4 March 2016 at 14:30, Arnd Bergmann <arnd@arndb.de> wrote:
> On Friday 04 March 2016 12:44:23 Ard Biesheuvel wrote:
>> On 4 March 2016 at 12:38, Arnd Bergmann <arnd@arndb.de> wrote:
>> > On Friday 04 March 2016 12:14:24 Ard Biesheuvel wrote:
>> >> On 4 March 2016 at 12:02, Russell King - ARM Linux
>> > Here is a patch I've come up with independently. I have verified
>> > that it removes all ldrd/strd from the btrfs unaligned data
>> > handling.
>> >
>> > The open question about it is whether we'd rather play safe and
>> > let the compiler handle unaligned accesses itself, removing the
>> > theoretical risk of the compiler optimizing
>> >
>> >         void *p;
>> >         u64 v = get_unaligned((u32)p) + (get_unaligned((u32)(p + 4)) << 32);
>> >
>> > into an ldrd. I think the linux/unaligned/access_ok.h implementation
>> > would allow that.
>> >
>>
>> I would assume that the compiler engineers are aware of the alignment
>> requirement of ldrd/strd, and don't promote adjacent accesses like
>> that if the pointer may not be 64-bit aligned.
>
> Ah, I thought it only required 32-bit alignment like ldm/stm, but it
> seems that it won't do that. However, an implementation like
>

It does only require 32-bit alignment, I was just confused.

> unsigned long long get_unaligned_u64(void *p)
> {
>         unsigned long long upper, lower;
>         lower = *(unsigned long*)p;
>         upper = *(unsigned long*)(p+4);
>
>         return lower | (upper << 32);
> }
>
> does get compiled into
>
> 00000000 <f>:
>    0:   e8900003        ldm     r0, {r0, r1}
>    4:   e12fff1e        bx      lr
>
> which is still wrong, so I assume there is some danger of that remaining
> with both of our patches, as the compiler might  decide to merge
> a series of unaligned 32-bit loads into an ldm, as long as our implementation
> incorrectly tells the compiler that the data is 32-bit aligned.
>
>> > + * This is the most generic implementation of unaligned accesses
>> > + * and should work almost anywhere.
>> > + */
>> > +#include <asm/byteorder.h>
>> > +#include <linux/kernel.h>
>> > +#include <asm/byteorder.h>
>>
>> Any particular reason to include this twice?
>
> No, just a mistake when merging the access_ok.h into this file.
>
>         Arnd

^ permalink raw reply	[flat|nested] 17+ messages in thread

* DWord alignment on ARMv7
  2016-03-04 13:30             ` Arnd Bergmann
  2016-03-04 13:33               ` Ard Biesheuvel
@ 2016-03-04 13:46               ` Russell King - ARM Linux
  2016-03-04 14:41                 ` Arnd Bergmann
  1 sibling, 1 reply; 17+ messages in thread
From: Russell King - ARM Linux @ 2016-03-04 13:46 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Mar 04, 2016 at 02:30:23PM +0100, Arnd Bergmann wrote:
> Ah, I thought it only required 32-bit alignment like ldm/stm, but it
> seems that it won't do that. However, an implementation like
> 
> unsigned long long get_unaligned_u64(void *p)
> {
>         unsigned long long upper, lower;
>         lower = *(unsigned long*)p;
>         upper = *(unsigned long*)(p+4);
> 
>         return lower | (upper << 32);
> }
> 
> does get compiled into
> 
> 00000000 <f>:
>    0:   e8900003        ldm     r0, {r0, r1}
>    4:   e12fff1e        bx      lr

I think it may be something of a bitch to work around, because the
compiler is going to do stuff like that behind your back.

The only way around that would be to bypass the compiler by using
asm(), but then you end up bypassing the instruction scheduling too.
That may not matter, as the resulting overhead may still be lower.

-- 
RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* DWord alignment on ARMv7
  2016-03-04 13:46               ` Russell King - ARM Linux
@ 2016-03-04 14:41                 ` Arnd Bergmann
  2016-03-04 14:56                   ` Russell King - ARM Linux
  0 siblings, 1 reply; 17+ messages in thread
From: Arnd Bergmann @ 2016-03-04 14:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Friday 04 March 2016 13:46:51 Russell King - ARM Linux wrote:
> On Fri, Mar 04, 2016 at 02:30:23PM +0100, Arnd Bergmann wrote:
> > Ah, I thought it only required 32-bit alignment like ldm/stm, but it
> > seems that it won't do that. However, an implementation like
> > 
> > unsigned long long get_unaligned_u64(void *p)
> > {
> >         unsigned long long upper, lower;
> >         lower = *(unsigned long*)p;
> >         upper = *(unsigned long*)(p+4);
> > 
> >         return lower | (upper << 32);
> > }
> > 
> > does get compiled into
> > 
> > 00000000 <f>:
> >    0:   e8900003        ldm     r0, {r0, r1}
> >    4:   e12fff1e        bx      lr
> 
> I think it may be something of a bitch to work around, because the
> compiler is going to do stuff like that behind your back.
> 
> The only way around that would be to bypass the compiler by using
> asm(), but then you end up bypassing the instruction scheduling too.
> That may not matter, as the resulting overhead may still be lower.

I think the compiler is correctly optimizing the code according to
what we tell it about the alignment here: the ABI explicitly specifies
that variables are naturally aligned, so whenever we dereference
a 32-bit pointer that is not __packed, it should assume it can safely
use any instruction that requires this alignment. This is essentially
what get_unaligned() is designed to work around, when we have
data that is known to be possibly misaligned.

Simply using the generic implementation from before 0567f5facbd
("asm-generic: allow generic unaligned access if the arch supports it")
solves the problem, as we explicitly tell the compiler about the
fact that data is unaligned and let it do the right thing (bytewise
access on ARMv5 or lower, normal access except ldm and ldlr on
later architectures).

The implementation appears to be suboptimal for cross-endian loads
though, as gcc-5.3 does not use the 'rev' instruction here but
falls back on byte accesses. We can easily fix that by introducing
one more generic implementation for the cross-endian accesses doing

static __always_inline void put_unaligned_be64(u64 val, void *p)
{
       __put_unaligned_cpu64((u64 __force)cpu_to_be64(val), p);
}
static __always_inline void put_unaligned_be32(u32 val, void *p)
{
       __put_unaligned_cpu32((u32 __force)cpu_to_be32(val), p);
}
static __always_inline void put_unaligned_be16(u16 val, void *p)
{
       __put_unaligned_cpu16((u16 __force)cpu_to_be16(val), p);
}

which is better on ARM than the currently used

static inline void __put_unaligned_le16(u16 val, u8 *p)
{
        *p++ = val;
        *p++ = val >> 8;
}
static inline void __put_unaligned_le32(u32 val, u8 *p)
{
        __put_unaligned_le16(val >> 16, p + 2);
        __put_unaligned_le16(val, p);
}
static inline void __put_unaligned_le64(u64 val, u8 *p)
{
        __put_unaligned_le32(val >> 32, p + 4);
        __put_unaligned_le32(val, p);
}

because it will allow using ldr/str+rev for unaligned
cross-endian accesses, but disallow ldm/stm+rev.

	Arnd

^ permalink raw reply	[flat|nested] 17+ messages in thread

* DWord alignment on ARMv7
  2016-03-04 14:41                 ` Arnd Bergmann
@ 2016-03-04 14:56                   ` Russell King - ARM Linux
  2016-03-04 15:49                     ` Arnd Bergmann
  0 siblings, 1 reply; 17+ messages in thread
From: Russell King - ARM Linux @ 2016-03-04 14:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Mar 04, 2016 at 03:41:58PM +0100, Arnd Bergmann wrote:
> On Friday 04 March 2016 13:46:51 Russell King - ARM Linux wrote:
> > On Fri, Mar 04, 2016 at 02:30:23PM +0100, Arnd Bergmann wrote:
> > > Ah, I thought it only required 32-bit alignment like ldm/stm, but it
> > > seems that it won't do that. However, an implementation like
> > > 
> > > unsigned long long get_unaligned_u64(void *p)
> > > {
> > >         unsigned long long upper, lower;
> > >         lower = *(unsigned long*)p;
> > >         upper = *(unsigned long*)(p+4);
> > > 
> > >         return lower | (upper << 32);
> > > }
> > > 
> > > does get compiled into
> > > 
> > > 00000000 <f>:
> > >    0:   e8900003        ldm     r0, {r0, r1}
> > >    4:   e12fff1e        bx      lr
> > 
> > I think it may be something of a bitch to work around, because the
> > compiler is going to do stuff like that behind your back.
> > 
> > The only way around that would be to bypass the compiler by using
> > asm(), but then you end up bypassing the instruction scheduling too.
> > That may not matter, as the resulting overhead may still be lower.
> 
> I think the compiler is correctly optimizing the code according to
> what we tell it about the alignment here:

It is, that's not what I was saying though - I wasn't saying whether or
not the compiler is correctly optimising the code (it is.)  I was saying
that we _don't_ want the compiler optimising the code in this way here -
that's a completely _different_ point.

> The implementation appears to be suboptimal for cross-endian loads
> though, as gcc-5.3 does not use the 'rev' instruction here but
> falls back on byte accesses. We can easily fix that by introducing
> one more generic implementation for the cross-endian accesses doing
> 
> static __always_inline void put_unaligned_be64(u64 val, void *p)
> {
>        __put_unaligned_cpu64((u64 __force)cpu_to_be64(val), p);
> }
> static __always_inline void put_unaligned_be32(u32 val, void *p)
> {
>        __put_unaligned_cpu32((u32 __force)cpu_to_be32(val), p);
> }
> static __always_inline void put_unaligned_be16(u16 val, void *p)
> {
>        __put_unaligned_cpu16((u16 __force)cpu_to_be16(val), p);
> }
> 
> which is better on ARM than the currently used
> 
> static inline void __put_unaligned_le16(u16 val, u8 *p)
> {
>         *p++ = val;
>         *p++ = val >> 8;
> }
> static inline void __put_unaligned_le32(u32 val, u8 *p)
> {
>         __put_unaligned_le16(val >> 16, p + 2);
>         __put_unaligned_le16(val, p);
> }
> static inline void __put_unaligned_le64(u64 val, u8 *p)
> {
>         __put_unaligned_le32(val >> 32, p + 4);
>         __put_unaligned_le32(val, p);
> }
> 
> because it will allow using ldr/str+rev for unaligned
> cross-endian accesses, but disallow ldm/stm+rev.

Looks like it's going to make the unaligned stuff even more of a rabbit
warren of include files.

-- 
RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* DWord alignment on ARMv7
  2016-03-04 14:56                   ` Russell King - ARM Linux
@ 2016-03-04 15:49                     ` Arnd Bergmann
  0 siblings, 0 replies; 17+ messages in thread
From: Arnd Bergmann @ 2016-03-04 15:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Friday 04 March 2016 14:56:56 Russell King - ARM Linux wrote:
> On Fri, Mar 04, 2016 at 03:41:58PM +0100, Arnd Bergmann wrote:
> > On Friday 04 March 2016 13:46:51 Russell King - ARM Linux wrote:
> > > On Fri, Mar 04, 2016 at 02:30:23PM +0100, Arnd Bergmann wrote:
> > > > Ah, I thought it only required 32-bit alignment like ldm/stm, but it
> > > > seems that it won't do that. However, an implementation like
> > > > 
> > > > unsigned long long get_unaligned_u64(void *p)
> > > > {
> > > >         unsigned long long upper, lower;
> > > >         lower = *(unsigned long*)p;
> > > >         upper = *(unsigned long*)(p+4);
> > > > 
> > > >         return lower | (upper << 32);
> > > > }
> > > > 
> > > > does get compiled into
> > > > 
> > > > 00000000 <f>:
> > > >    0:   e8900003        ldm     r0, {r0, r1}
> > > >    4:   e12fff1e        bx      lr
> > > 
> > > I think it may be something of a bitch to work around, because the
> > > compiler is going to do stuff like that behind your back.
> > > 
> > > The only way around that would be to bypass the compiler by using
> > > asm(), but then you end up bypassing the instruction scheduling too.
> > > That may not matter, as the resulting overhead may still be lower.
> > 
> > I think the compiler is correctly optimizing the code according to
> > what we tell it about the alignment here:
> 
> It is, that's not what I was saying though - I wasn't saying whether or
> not the compiler is correctly optimising the code (it is.)  I was saying
> that we _don't_ want the compiler optimising the code in this way here -
> that's a completely _different_ point.

Yes, I understood that. What I meant was that the kernel implementation
here tells the compiler that it's safe to do certain optimization when
it's actually not.

> > 
> > because it will allow using ldr/str+rev for unaligned
> > cross-endian accesses, but disallow ldm/stm+rev.
> 
> Looks like it's going to make the unaligned stuff even more of a rabbit
> warren of include files.

So here is what I came up with. I think this should be good on
most architectures (if we are not sure about that, we could make
it ARM and ARM64 specific), and it's actually noticeably simpler than
the previous version. This again removes the distinction between
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS set and not set in the
get_unaligned code (as it was before the patch that introduced
the regression), but optimizes the cross-endian loads/stores enough
so the compiler uses the same access as native-endian, followed
by a byte swap operation on architectures that have it.

	Arnd

commit 5987b14b8d13eafc3bb78b070c4d482706cf3ce6
Author: Arnd Bergmann <arnd@arndb.de>
Date:   Fri Mar 4 16:15:20 2016 +0100

    asm-generic: always use struct based unaligned access
    
    Signed-off-by: Arnd Bergmann <arnd@arndb.de>

 include/asm-generic/unaligned.h     | 20 +++++---------------
 include/linux/unaligned/be_struct.h | 13 +++++++------
 include/linux/unaligned/le_struct.h | 13 +++++++------
 3 files changed, 19 insertions(+), 27 deletions(-)

diff --git a/include/asm-generic/unaligned.h b/include/asm-generic/unaligned.h
index 1ac097279db1..e8f5523eeb0a 100644
--- a/include/asm-generic/unaligned.h
+++ b/include/asm-generic/unaligned.h
@@ -3,29 +3,19 @@
 
 /*
  * This is the most generic implementation of unaligned accesses
- * and should work almost anywhere.
+ * and should work almost anywhere, we trust that the compiler
+ * knows how to handle unaligned accesses.
  */
 #include <asm/byteorder.h>
 
-/* Set by the arch if it can handle unaligned accesses in hardware. */
-#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
-# include <linux/unaligned/access_ok.h>
-#endif
+#include <linux/unaligned/le_struct.h>
+#include <linux/unaligned/be_struct.h>
+#include <linux/unaligned/generic.h>
 
 #if defined(__LITTLE_ENDIAN)
-# ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
-#  include <linux/unaligned/le_struct.h>
-#  include <linux/unaligned/be_byteshift.h>
-# endif
-# include <linux/unaligned/generic.h>
 # define get_unaligned	__get_unaligned_le
 # define put_unaligned	__put_unaligned_le
 #elif defined(__BIG_ENDIAN)
-# ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
-#  include <linux/unaligned/be_struct.h>
-#  include <linux/unaligned/le_byteshift.h>
-# endif
-# include <linux/unaligned/generic.h>
 # define get_unaligned	__get_unaligned_be
 # define put_unaligned	__put_unaligned_be
 #else
diff --git a/include/linux/unaligned/be_struct.h b/include/linux/unaligned/be_struct.h
index 132415836c50..9ab8c53bb3fe 100644
--- a/include/linux/unaligned/be_struct.h
+++ b/include/linux/unaligned/be_struct.h
@@ -2,35 +2,36 @@
 #define _LINUX_UNALIGNED_BE_STRUCT_H
 
 #include <linux/unaligned/packed_struct.h>
+#include <asm/byteorder.h>
 
 static inline u16 get_unaligned_be16(const void *p)
 {
-	return __get_unaligned_cpu16((const u8 *)p);
+	return be16_to_cpu((__le16 __force)__get_unaligned_cpu16((const u8 *)p));
 }
 
 static inline u32 get_unaligned_be32(const void *p)
 {
-	return __get_unaligned_cpu32((const u8 *)p);
+	return be32_to_cpu((__le32 __force)__get_unaligned_cpu32((const u8 *)p));
 }
 
 static inline u64 get_unaligned_be64(const void *p)
 {
-	return __get_unaligned_cpu64((const u8 *)p);
+	return be64_to_cpu((__le64 __force)__get_unaligned_cpu64((const u8 *)p));
 }
 
 static inline void put_unaligned_be16(u16 val, void *p)
 {
-	__put_unaligned_cpu16(val, p);
+	__put_unaligned_cpu16((u16 __force)cpu_to_be16(val), p);
 }
 
 static inline void put_unaligned_be32(u32 val, void *p)
 {
-	__put_unaligned_cpu32(val, p);
+	__put_unaligned_cpu32((u32 __force)cpu_to_be32(val), p);
 }
 
 static inline void put_unaligned_be64(u64 val, void *p)
 {
-	__put_unaligned_cpu64(val, p);
+	__put_unaligned_cpu64((u64 __force)cpu_to_be64(val), p);
 }
 
 #endif /* _LINUX_UNALIGNED_BE_STRUCT_H */
diff --git a/include/linux/unaligned/le_struct.h b/include/linux/unaligned/le_struct.h
index 088c4572faa8..64171ad0b100 100644
--- a/include/linux/unaligned/le_struct.h
+++ b/include/linux/unaligned/le_struct.h
@@ -2,35 +2,36 @@
 #define _LINUX_UNALIGNED_LE_STRUCT_H
 
 #include <linux/unaligned/packed_struct.h>
+#include <asm/byteorder.h>
 
 static inline u16 get_unaligned_le16(const void *p)
 {
-	return __get_unaligned_cpu16((const u8 *)p);
+	return le16_to_cpu((__le16 __force)__get_unaligned_cpu16((const u8 *)p));
 }
 
 static inline u32 get_unaligned_le32(const void *p)
 {
-	return __get_unaligned_cpu32((const u8 *)p);
+	return le32_to_cpu((__le32 __force)__get_unaligned_cpu32((const u8 *)p));
 }
 
 static inline u64 get_unaligned_le64(const void *p)
 {
-	return __get_unaligned_cpu64((const u8 *)p);
+	return le64_to_cpu((__le64 __force)__get_unaligned_cpu64((const u8 *)p));
 }
 
 static inline void put_unaligned_le16(u16 val, void *p)
 {
-	__put_unaligned_cpu16(val, p);
+	__put_unaligned_cpu16((u16 __force)cpu_to_le16(val), p);
 }
 
 static inline void put_unaligned_le32(u32 val, void *p)
 {
-	__put_unaligned_cpu32(val, p);
+	__put_unaligned_cpu32((u32 __force)cpu_to_le32(val), p);
 }
 
 static inline void put_unaligned_le64(u64 val, void *p)
 {
-	__put_unaligned_cpu64(val, p);
+	__put_unaligned_cpu64((u64 __force)cpu_to_le64(val), p);
 }
 
 #endif /* _LINUX_UNALIGNED_LE_STRUCT_H */

^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-03-04 15:49 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-03 22:27 DWord alignment on ARMv7 Marc Kleine-Budde
2016-03-03 23:54 ` Will Deacon
2016-03-04  8:01   ` btrfs_get_token_64() alignment problem on ARM (was: Re: DWord alignment on ARMv7) Marc Kleine-Budde
2016-03-04  9:16     ` David Sterba
2016-03-04 10:48   ` DWord alignment on ARMv7 Ard Biesheuvel
2016-03-04 11:02     ` Russell King - ARM Linux
2016-03-04 11:14       ` Ard Biesheuvel
2016-03-04 11:19         ` Marc Kleine-Budde
2016-03-04 11:26           ` Ard Biesheuvel
2016-03-04 11:38         ` Arnd Bergmann
2016-03-04 11:44           ` Ard Biesheuvel
2016-03-04 13:30             ` Arnd Bergmann
2016-03-04 13:33               ` Ard Biesheuvel
2016-03-04 13:46               ` Russell King - ARM Linux
2016-03-04 14:41                 ` Arnd Bergmann
2016-03-04 14:56                   ` Russell King - ARM Linux
2016-03-04 15:49                     ` Arnd Bergmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).