ARM: big performance waste in memcpy

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* ARM: big performance waste in memcpy_{from,to}io
@ 2009-11-12 16:49 Hubert Feurstein
  2009-11-12 18:44 ` Alexander Clouter
  2009-11-13 23:14 ` Ben Dooks
  0 siblings, 2 replies; 9+ messages in thread
From: Hubert Feurstein @ 2009-11-12 16:49 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Russel,

I'm working with an Contec Micro9 board (ep93xx-based with two Spansion-NOR-
Flash chips in parallel => 32bit memory-buswidth) and was wondering why the
read-performance of the flash (through /dev/mtd*) is so quite poor. So I 
connected a logic analyser to the data- and address-bus and recognized that 
the accesses to the same flash-word-address happens four times. This means 
that the flash is read byte-by-byte, which is IMO a big waste of performance 
since it would be possible to read the full word (four bytes) at once. So I 
digged around in the mtd-driver and found the function "memcpy_fromio" which 
is called to read the flash data. I was really surprised when looked to the 
implementation, which is:

arch/arm/kernel/io.c:

/*
 * Copy data from IO memory space to "real" memory space.
 * This needs to be optimized.
 */
void _memcpy_fromio(void *to, const volatile void __iomem *from, size_t count)
{
	unsigned char *t = to;
	while (count) {
		count--;
		*t = readb(from);
		t++;
		from++;
	}
}

Ok, with this poor memcpy-implementation the poor flash-read-performance is
fully explainable. So I tried to fix this. I found the real "memcpy"
implementation which is written in assemler and seems to be quite optimized.
So I changed the the code to this:

Index: linux-2.6.31/arch/arm/include/asm/io.h
===================================================================
--- linux-2.6.31.orig/arch/arm/include/asm/io.h
+++ linux-2.6.31/arch/arm/include/asm/io.h
@@ -195,9 +195,9 @@ extern void _memset_io(volatile void __i
 #define writesw(p,d,l)		__raw_writesw(__mem_pci(p),d,l)
 #define writesl(p,d,l)		__raw_writesl(__mem_pci(p),d,l)

-#define memset_io(c,v,l)	_memset_io(__mem_pci(c),(v),(l))
-#define memcpy_fromio(a,c,l)	_memcpy_fromio((a),__mem_pci(c),(l))
-#define memcpy_toio(c,a,l)	_memcpy_toio(__mem_pci(c),(a),(l))
+#define memset_io(c,v,l)	memset(__mem_pci(c),(v),(l))
+#define memcpy_fromio(a,c,l)	memcpy((a),__mem_pci(c),(l))
+#define memcpy_toio(c,a,l)	memcpy(__mem_pci(c),(a),(l))

 #elif !defined(readb)

Because on the ARM architecture there is no difference between io-memspace
and the 'real' memspace so it should work. The following tests show the impact
of this change:

[root at micro9]\# cat /proc/mtd
dev:    size   erasesize  name
mtd0: 00040000 00020000 "RedBoot"
mtd1: 01fa0000 00020000 "test"
mtd2: 0001f000 00020000 "FIS directory"
mtd3: 00001000 00020000 "RedBoot config"

This is the read-time with the original ARM implementation:

[root at micro9]\# time cat /dev/mtd1 > /dev/null
real    0m 7.27s
user    0m 0.00s
sys     0m 7.26s

and here is the read-time with my simple change:

[root at micro9]\# time cat /dev/mtd1 > /dev/null
real    0m 0.96s
user    0m 0.00s
sys     0m 0.95s

Wow, that is more than 7.6-times faster!

Because of the word-accesses to the bus, I can take advantage of the burst-
mode option of the SMC (static memory controller) of the ep93xx which
increased the performance by 35% (0.96s was already measured with burst-mode
enabled). With the byte-accesses of the original implementation the burst-mode
seem to have no influence at all.

I've seen that such "simple and slow" memcpy_{to,from)io implementations exist
in many other architectures. So maybe this is a big potential to improve 
overall io-performance, since a lot of drivers use these memcpy_{to,from)io 
functions.

For testing I used kernel version 2.6.31.

Are there any drawbacks when using the good-and-fast "memcpy" ? On my Micro9-
board everything is running fine so far.

Best Regards,
Hubert

---
Hubert Feurstein
Software-Engineer

Contec Steuerungstechnik & Automation GmbH
Wildbichler Stra?e 2e
6341 Ebbs
Austria
www.contec.at

^ permalink raw reply	[flat|nested] 9+ messages in thread

* ARM: big performance waste in memcpy_{from,to}io
  2009-11-12 16:49 ARM: big performance waste in memcpy_{from,to}io Hubert Feurstein
@ 2009-11-12 18:44 ` Alexander Clouter
  2009-11-13 11:32   ` Hubert Feurstein
  2009-11-13 23:14 ` Ben Dooks
  1 sibling, 1 reply; 9+ messages in thread
From: Alexander Clouter @ 2009-11-12 18:44 UTC (permalink / raw)
  To: linux-arm-kernel

Hubert Feurstein <h.feurstein@gmail.com> wrote:
> 
> I'm working with an Contec Micro9 board (ep93xx-based with two Spansion-NOR-
> Flash chips in parallel => 32bit memory-buswidth) and was wondering why the
> read-performance of the flash (through /dev/mtd*) is so quite poor. So I 
> connected a logic analyser to the data- and address-bus and recognized that 
> the accesses to the same flash-word-address happens four times. This means 
> that the flash is read byte-by-byte, which is IMO a big waste of performance 
> since it would be possible to read the full word (four bytes) at once. So I 
> digged around in the mtd-driver and found the function "memcpy_fromio" which 
> is called to read the flash data. I was really surprised when looked to the 
> implementation, which is:
> 
> [snipped]
> 
> Are there any drawbacks when using the good-and-fast "memcpy" ? On my Micro9-
> board everything is running fine so far.
> 

>From bogus@does.not.exist.com  Fri Nov  6 13:01:15 2009
From: bogus@does.not.exist.com ()
Date: Fri, 06 Nov 2009 18:01:15 -0000
Subject: No subject
Message-ID: <mailman.17.1258053123.2170.linux-arm-kernel@lists.infradead.org>

NOR's too) do *not* support 32bit wide read's.  For example, if I 
remember correctly, the NAND driver for orion will let you read 32bits 
but write only 8bits at a time.  Other platforms are only 8bit wide in 
both direction.  I guess the 'slow' memcpy version is used as 
*everything* supports 8bit reads....I *guess* :)

For the NAND world, we have plat_nand and you can populate 
(read|write)_buf with your optimised version; of course you are using 
NOR so I have no idea if you have to treat that differently....

Cheers

-- 
Alexander Clouter
.sigmonster says: Make sure your code does nothing gracefully.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* ARM: big performance waste in memcpy_{from,to}io
  2009-11-12 18:44 ` Alexander Clouter
@ 2009-11-13 11:32   ` Hubert Feurstein
  2009-11-13 12:24     ` Russell King - ARM Linux
  0 siblings, 1 reply; 9+ messages in thread
From: Hubert Feurstein @ 2009-11-13 11:32 UTC (permalink / raw)
  To: linux-arm-kernel

Am Donnerstag, 12. November 2009 19:44:40 schrieb Alexander Clouter:
> > Are there any drawbacks when using the good-and-fast "memcpy" ? On my
> > Micro9- board everything is running fine so far.
> 
> From the small bit of MTD work I have done, some NAND's (and I guess
> NOR's too) do *not* support 32bit wide read's.  For example, if I
> remember correctly, the NAND driver for orion will let you read 32bits
> but write only 8bits at a time.  Other platforms are only 8bit wide in
> both direction.  I guess the 'slow' memcpy version is used as
> *everything* supports 8bit reads....I *guess* :)
> 

The Spansion NOR Flashes (like S29GL...) have a 16bit wide data bus, by using 
two in parallel it looks like a 32bit-flash for the system. So 32bit reads 
from the flash works just fine. But in fact that is not the point at all. And 
it is not an issue of the mtd-driver, that's why I have posted this only to 
the arm-linux-mailinglist and not to the mtd-list.

The memcpy_{to,from}io-function don't has to care about the bus-width of the 
attached peripheral, because this is already handled correctly by the static 
memory controller of your arm-derivate (Of course this one has to be 
configured correctly to the peripherals bus width). In the rare case where you 
have to take care about that it is anyway a bad idea to use a memcpy_xxio-
function.

I've checked the implementations of some other architectures. And a lot of 
them already have optimized memcpy_{from,to}io functions:

	- alpha/io.c
	- parisc/lib/io.c
	- powerpc/kernel/io.c
	- avr32/include/asm/io.h
	- blackfin/include/asm/io.h
	- cris/include/asm/io.h
	- frv/include/asm/io.h
	- h8300/include/asm/io.h
	- m32r/include/asm/io.h
	- m68k, microblaze, mips, mn10300, ...

One architecture which also uses the simple and slow version is 'sh' (and 
maybe there are a few others).

Just want to ask the community if there is a really good reason why this 
bottle neck is still in the ARM kernel?

@Russell: What's your opinion on that?

best regards
Hubert

^ permalink raw reply	[flat|nested] 9+ messages in thread

* ARM: big performance waste in memcpy_{from,to}io
  2009-11-13 11:32   ` Hubert Feurstein
@ 2009-11-13 12:24     ` Russell King - ARM Linux
  2009-11-13 12:42       ` Andy Green
                         ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Russell King - ARM Linux @ 2009-11-13 12:24 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Nov 13, 2009 at 12:32:41PM +0100, Hubert Feurstein wrote:
> The memcpy_{to,from}io-function don't has to care about the bus-width of the 
> attached peripheral, because this is already handled correctly by the static 
> memory controller of your arm-derivate (Of course this one has to be 
> configured correctly to the peripherals bus width). In the rare case where you 
> have to take care about that it is anyway a bad idea to use a memcpy_xxio-
> function.

I believe there are SoCs where using 32-bit reads on lesser-width buses
generate faults.  There are certainly SoCs which do abort reads/writes
to certain peripherals which aren't the right size.

> Just want to ask the community if there is a really good reason why this 
> bottle neck is still in the ARM kernel?

No one has been bothered for the last 15 years to optimise it.

> @Russell: What's your opinion on that?

The problem is whether optimising it will end up breaking existing users.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* ARM: big performance waste in memcpy_{from,to}io
  2009-11-13 12:24     ` Russell King - ARM Linux
@ 2009-11-13 12:42       ` Andy Green
  2009-11-13 14:00       ` Bill Gatliff
  2009-11-13 15:16       ` ARM: big performance waste in memcpy_{from,to}io Hubert Feurstein
  2 siblings, 0 replies; 9+ messages in thread
From: Andy Green @ 2009-11-13 12:42 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/13/09 12:24, Somebody in the thread at some point said:
> On Fri, Nov 13, 2009 at 12:32:41PM +0100, Hubert Feurstein wrote:
>> The memcpy_{to,from}io-function don't has to care about the bus-width of the
>> attached peripheral, because this is already handled correctly by the static
>> memory controller of your arm-derivate (Of course this one has to be
>> configured correctly to the peripherals bus width). In the rare case where you
>> have to take care about that it is anyway a bad idea to use a memcpy_xxio-
>> function.
>
> I believe there are SoCs where using 32-bit reads on lesser-width buses
> generate faults.  There are certainly SoCs which do abort reads/writes
> to certain peripherals which aren't the right size.

Not much help to the discussion about memcpy, but I saw an example of 
this death by bus size mismatch is iMX31's internal watchdog.  I failed 
to notice the registers were (unusually) specified to be 16-bit.

Using a u32 * to access the first register, which was 32-bit aligned OK, 
gave an "imprecise external abort" until I read the datasheet closer and 
used a u16 *.

-Andy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* ARM: big performance waste in memcpy_{from,to}io
  2009-11-13 12:24     ` Russell King - ARM Linux
  2009-11-13 12:42       ` Andy Green
@ 2009-11-13 14:00       ` Bill Gatliff
  2009-11-16 14:57         ` [RFC PATCH] ARM: add (experimental) alternative memcpy_{from, to}io() and memset_io() Hubert Feurstein
  2009-11-13 15:16       ` ARM: big performance waste in memcpy_{from,to}io Hubert Feurstein
  2 siblings, 1 reply; 9+ messages in thread
From: Bill Gatliff @ 2009-11-13 14:00 UTC (permalink / raw)
  To: linux-arm-kernel

Russell King - ARM Linux wrote:
> The problem is whether optimising it will end up breaking existing users.
>   

I say we throw in the optimized version and enable it by default, along
with a Kconfig to defeat it upon user request at build time (or maybe a
module parameter).  And then see who complains.


b.g.

-- 
Bill Gatliff
bgat at billgatliff.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* ARM: big performance waste in memcpy_{from,to}io
  2009-11-13 12:24     ` Russell King - ARM Linux
  2009-11-13 12:42       ` Andy Green
  2009-11-13 14:00       ` Bill Gatliff
@ 2009-11-13 15:16       ` Hubert Feurstein
  2 siblings, 0 replies; 9+ messages in thread
From: Hubert Feurstein @ 2009-11-13 15:16 UTC (permalink / raw)
  To: linux-arm-kernel

2009/11/13 Russell King - ARM Linux <linux@arm.linux.org.uk>:
> On Fri, Nov 13, 2009 at 12:32:41PM +0100, Hubert Feurstein wrote:
>> The memcpy_{to,from}io-function don't has to care about the bus-width of the
>> attached peripheral, because this is already handled correctly by the static
>> memory controller of your arm-derivate (Of course this one has to be
>> configured correctly to the peripherals bus width). In the rare case where you
>> have to take care about that it is anyway a bad idea to use a memcpy_xxio-
>> function.
>
> I believe there are SoCs where using 32-bit reads on lesser-width buses
> generate faults. ?There are certainly SoCs which do abort reads/writes
> to certain peripherals which aren't the right size.
>
>> Just want to ask the community if there is a really good reason why this
>> bottle neck is still in the ARM kernel?
>
> No one has been bothered for the last 15 years to optimise it.
>

Well, I do, because performance matters.

>> @Russell: What's your opinion on that?
>
> The problem is whether optimising it will end up breaking existing users.
>

Of course you are right, such a change will definitly needs good testing. But
the fear of breaking things should never stop you from improving. Even if it
turns out that the normal good optimized memcpy is not 100% suitable in all
cases there are many other ways of improving memcpy_xxio. IMO a memcpy_xxio
function should be something which uses the power of the architecture,
because it is
used so often. And as my tests showed there is big potential in it.

Regards
Hubert

^ permalink raw reply	[flat|nested] 9+ messages in thread

* ARM: big performance waste in memcpy_{from,to}io
  2009-11-12 16:49 ARM: big performance waste in memcpy_{from,to}io Hubert Feurstein
  2009-11-12 18:44 ` Alexander Clouter
@ 2009-11-13 23:14 ` Ben Dooks
  1 sibling, 0 replies; 9+ messages in thread
From: Ben Dooks @ 2009-11-13 23:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Nov 12, 2009 at 05:49:49PM +0100, Hubert Feurstein wrote:
> Hi Russel,
> 
> I'm working with an Contec Micro9 board (ep93xx-based with two Spansion-NOR-
> Flash chips in parallel => 32bit memory-buswidth) and was wondering why the
> read-performance of the flash (through /dev/mtd*) is so quite poor. So I 
> connected a logic analyser to the data- and address-bus and recognized that 
> the accesses to the same flash-word-address happens four times. This means 
> that the flash is read byte-by-byte, which is IMO a big waste of performance 
> since it would be possible to read the full word (four bytes) at once. So I 
> digged around in the mtd-driver and found the function "memcpy_fromio" which 
> is called to read the flash data. I was really surprised when looked to the 
> implementation, which is:

Why use this, there's better {read,write}{b,w,l}s calls available which
do the job in lovely optimised assembler.

-- 
Ben

Q:      What's a light-year?
A:      One-third less calories than a regular year.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH] ARM: add (experimental) alternative memcpy_{from, to}io() and memset_io()
  2009-11-13 14:00       ` Bill Gatliff
@ 2009-11-16 14:57         ` Hubert Feurstein
  0 siblings, 0 replies; 9+ messages in thread
From: Hubert Feurstein @ 2009-11-16 14:57 UTC (permalink / raw)
  To: linux-arm-kernel

Implement faster memcpy_{to,from}io() and memset_io() by using the
highly optimized mem{cpy,set} implementation instead of sequential
8bit read/write accesses. This gives significantly higher throughput
accessing big iomem-regions.

Signed-off-by: Hubert Feurstein <h.feurstein@gmail.com>
---

Hi Russell,

would such a patch be acceptable, since it would not break existing users?

Would be great if some people from the community could test it and post their
experience.

Regards
Hubert

 arch/arm/Kconfig          |   12 ++++++++++++
 arch/arm/include/asm/io.h |    6 ++++++
 2 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 1c4119c..2723948 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1221,6 +1221,18 @@ config UACCESS_WITH_MEMCPY
 	  However, if the CPU data cache is using a write-allocate mode,
 	  this option is unlikely to provide any performance gain.
 
+config OPTIMIZED_IOMEM_MEMCPY
+	bool "Use kernel memcpy() for memcpy_{from,to}io() (EXPERIMENTAL)"
+	depends on EXPERIMENTAL
+	help
+	  Implement faster memcpy_{to,from}io() and memset_io() by using the
+	  highly optimized mem{cpy,set} implementation instead of sequential
+	  8bit read/write accesses. This gives significantly higher throughput
+	  accessing big iomem-regions.
+
+	  This might not work on all machines in case they have problems accessing
+	  the iomem-space word-by-word.
+
 endmenu
 
 menu "Boot options"
diff --git a/arch/arm/include/asm/io.h b/arch/arm/include/asm/io.h
index d2a59cf..b37e057 100644
--- a/arch/arm/include/asm/io.h
+++ b/arch/arm/include/asm/io.h
@@ -195,9 +195,15 @@ extern void _memset_io(volatile void __iomem *, int, size_t);
 #define writesw(p,d,l)		__raw_writesw(__mem_pci(p),d,l)
 #define writesl(p,d,l)		__raw_writesl(__mem_pci(p),d,l)
 
+#ifdef CONFIG_OPTIMIZED_IOMEM_MEMCPY
+#define memset_io(c,v,l)	memset(__mem_pci(c),(v),(l))
+#define memcpy_fromio(a,c,l)	memcpy((a),__mem_pci(c),(l))
+#define memcpy_toio(c,a,l)	memcpy(__mem_pci(c),(a),(l))
+#else
 #define memset_io(c,v,l)	_memset_io(__mem_pci(c),(v),(l))
 #define memcpy_fromio(a,c,l)	_memcpy_fromio((a),__mem_pci(c),(l))
 #define memcpy_toio(c,a,l)	_memcpy_toio(__mem_pci(c),(a),(l))
+#endif
 
 #elif !defined(readb)
 
-- 
1.6.4.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2009-11-16 14:57 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-11-12 16:49 ARM: big performance waste in memcpy_{from,to}io Hubert Feurstein
2009-11-12 18:44 ` Alexander Clouter
2009-11-13 11:32   ` Hubert Feurstein
2009-11-13 12:24     ` Russell King - ARM Linux
2009-11-13 12:42       ` Andy Green
2009-11-13 14:00       ` Bill Gatliff
2009-11-16 14:57         ` [RFC PATCH] ARM: add (experimental) alternative memcpy_{from, to}io() and memset_io() Hubert Feurstein
2009-11-13 15:16       ` ARM: big performance waste in memcpy_{from,to}io Hubert Feurstein
2009-11-13 23:14 ` Ben Dooks

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).