From mboxrd@z Thu Jan  1 00:00:00 1970
From: h.feurstein@gmail.com (Hubert Feurstein)
Date: Thu, 12 Nov 2009 17:49:49 +0100
Subject: ARM: big performance waste in memcpy_{from,to}io
Message-ID: <200911121749.49676.h.feurstein@gmail.com>
To: linux-arm-kernel@lists.infradead.org
List-Id: linux-arm-kernel.lists.infradead.org

Hi Russel,

I'm working with an Contec Micro9 board (ep93xx-based with two Spansion-NOR-
Flash chips in parallel => 32bit memory-buswidth) and was wondering why the
read-performance of the flash (through /dev/mtd*) is so quite poor. So I 
connected a logic analyser to the data- and address-bus and recognized that 
the accesses to the same flash-word-address happens four times. This means 
that the flash is read byte-by-byte, which is IMO a big waste of performance 
since it would be possible to read the full word (four bytes) at once. So I 
digged around in the mtd-driver and found the function "memcpy_fromio" which 
is called to read the flash data. I was really surprised when looked to the 
implementation, which is:

arch/arm/kernel/io.c:

/*
 * Copy data from IO memory space to "real" memory space.
 * This needs to be optimized.
 */
void _memcpy_fromio(void *to, const volatile void __iomem *from, size_t count)
{
	unsigned char *t = to;
	while (count) {
		count--;
		*t = readb(from);
		t++;
		from++;
	}
}

Ok, with this poor memcpy-implementation the poor flash-read-performance is
fully explainable. So I tried to fix this. I found the real "memcpy"
implementation which is written in assemler and seems to be quite optimized.
So I changed the the code to this:

Index: linux-2.6.31/arch/arm/include/asm/io.h
===================================================================
--- linux-2.6.31.orig/arch/arm/include/asm/io.h
+++ linux-2.6.31/arch/arm/include/asm/io.h
@@ -195,9 +195,9 @@ extern void _memset_io(volatile void __i
 #define writesw(p,d,l)		__raw_writesw(__mem_pci(p),d,l)
 #define writesl(p,d,l)		__raw_writesl(__mem_pci(p),d,l)

-#define memset_io(c,v,l)	_memset_io(__mem_pci(c),(v),(l))
-#define memcpy_fromio(a,c,l)	_memcpy_fromio((a),__mem_pci(c),(l))
-#define memcpy_toio(c,a,l)	_memcpy_toio(__mem_pci(c),(a),(l))
+#define memset_io(c,v,l)	memset(__mem_pci(c),(v),(l))
+#define memcpy_fromio(a,c,l)	memcpy((a),__mem_pci(c),(l))
+#define memcpy_toio(c,a,l)	memcpy(__mem_pci(c),(a),(l))

 #elif !defined(readb)

Because on the ARM architecture there is no difference between io-memspace
and the 'real' memspace so it should work. The following tests show the impact
of this change:

[root at micro9]\# cat /proc/mtd
dev:    size   erasesize  name
mtd0: 00040000 00020000 "RedBoot"
mtd1: 01fa0000 00020000 "test"
mtd2: 0001f000 00020000 "FIS directory"
mtd3: 00001000 00020000 "RedBoot config"

This is the read-time with the original ARM implementation:

[root at micro9]\# time cat /dev/mtd1 > /dev/null
real    0m 7.27s
user    0m 0.00s
sys     0m 7.26s

and here is the read-time with my simple change:

[root at micro9]\# time cat /dev/mtd1 > /dev/null
real    0m 0.96s
user    0m 0.00s
sys     0m 0.95s

Wow, that is more than 7.6-times faster!

Because of the word-accesses to the bus, I can take advantage of the burst-
mode option of the SMC (static memory controller) of the ep93xx which
increased the performance by 35% (0.96s was already measured with burst-mode
enabled). With the byte-accesses of the original implementation the burst-mode
seem to have no influence at all.

I've seen that such "simple and slow" memcpy_{to,from)io implementations exist
in many other architectures. So maybe this is a big potential to improve 
overall io-performance, since a lot of drivers use these memcpy_{to,from)io 
functions.

For testing I used kernel version 2.6.31.

Are there any drawbacks when using the good-and-fast "memcpy" ? On my Micro9-
board everything is running fine so far.

Best Regards,
Hubert

---
Hubert Feurstein
Software-Engineer

Contec Steuerungstechnik & Automation GmbH
Wildbichler Stra?e 2e
6341 Ebbs
Austria
www.contec.at