linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
From: h.feurstein@gmail.com (Hubert Feurstein)
To: linux-arm-kernel@lists.infradead.org
Subject: ARM: big performance waste in memcpy_{from,to}io
Date: Thu, 12 Nov 2009 17:49:49 +0100	[thread overview]
Message-ID: <200911121749.49676.h.feurstein@gmail.com> (raw)

Hi Russel,

I'm working with an Contec Micro9 board (ep93xx-based with two Spansion-NOR-
Flash chips in parallel => 32bit memory-buswidth) and was wondering why the
read-performance of the flash (through /dev/mtd*) is so quite poor. So I 
connected a logic analyser to the data- and address-bus and recognized that 
the accesses to the same flash-word-address happens four times. This means 
that the flash is read byte-by-byte, which is IMO a big waste of performance 
since it would be possible to read the full word (four bytes) at once. So I 
digged around in the mtd-driver and found the function "memcpy_fromio" which 
is called to read the flash data. I was really surprised when looked to the 
implementation, which is:

arch/arm/kernel/io.c:

/*
 * Copy data from IO memory space to "real" memory space.
 * This needs to be optimized.
 */
void _memcpy_fromio(void *to, const volatile void __iomem *from, size_t count)
{
	unsigned char *t = to;
	while (count) {
		count--;
		*t = readb(from);
		t++;
		from++;
	}
}

Ok, with this poor memcpy-implementation the poor flash-read-performance is
fully explainable. So I tried to fix this. I found the real "memcpy"
implementation which is written in assemler and seems to be quite optimized.
So I changed the the code to this:

Index: linux-2.6.31/arch/arm/include/asm/io.h
===================================================================
--- linux-2.6.31.orig/arch/arm/include/asm/io.h
+++ linux-2.6.31/arch/arm/include/asm/io.h
@@ -195,9 +195,9 @@ extern void _memset_io(volatile void __i
 #define writesw(p,d,l)		__raw_writesw(__mem_pci(p),d,l)
 #define writesl(p,d,l)		__raw_writesl(__mem_pci(p),d,l)

-#define memset_io(c,v,l)	_memset_io(__mem_pci(c),(v),(l))
-#define memcpy_fromio(a,c,l)	_memcpy_fromio((a),__mem_pci(c),(l))
-#define memcpy_toio(c,a,l)	_memcpy_toio(__mem_pci(c),(a),(l))
+#define memset_io(c,v,l)	memset(__mem_pci(c),(v),(l))
+#define memcpy_fromio(a,c,l)	memcpy((a),__mem_pci(c),(l))
+#define memcpy_toio(c,a,l)	memcpy(__mem_pci(c),(a),(l))

 #elif !defined(readb)

Because on the ARM architecture there is no difference between io-memspace
and the 'real' memspace so it should work. The following tests show the impact
of this change:

[root at micro9]\# cat /proc/mtd
dev:    size   erasesize  name
mtd0: 00040000 00020000 "RedBoot"
mtd1: 01fa0000 00020000 "test"
mtd2: 0001f000 00020000 "FIS directory"
mtd3: 00001000 00020000 "RedBoot config"

This is the read-time with the original ARM implementation:

[root at micro9]\# time cat /dev/mtd1 > /dev/null
real    0m 7.27s
user    0m 0.00s
sys     0m 7.26s

and here is the read-time with my simple change:

[root at micro9]\# time cat /dev/mtd1 > /dev/null
real    0m 0.96s
user    0m 0.00s
sys     0m 0.95s

Wow, that is more than 7.6-times faster!

Because of the word-accesses to the bus, I can take advantage of the burst-
mode option of the SMC (static memory controller) of the ep93xx which
increased the performance by 35% (0.96s was already measured with burst-mode
enabled). With the byte-accesses of the original implementation the burst-mode
seem to have no influence at all.

I've seen that such "simple and slow" memcpy_{to,from)io implementations exist
in many other architectures. So maybe this is a big potential to improve 
overall io-performance, since a lot of drivers use these memcpy_{to,from)io 
functions.

For testing I used kernel version 2.6.31.

Are there any drawbacks when using the good-and-fast "memcpy" ? On my Micro9-
board everything is running fine so far.

Best Regards,
Hubert

---
Hubert Feurstein
Software-Engineer

Contec Steuerungstechnik & Automation GmbH
Wildbichler Stra?e 2e
6341 Ebbs
Austria
www.contec.at

             reply	other threads:[~2009-11-12 16:49 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-11-12 16:49 Hubert Feurstein [this message]
2009-11-12 18:44 ` ARM: big performance waste in memcpy_{from,to}io Alexander Clouter
2009-11-13 11:32   ` Hubert Feurstein
2009-11-13 12:24     ` Russell King - ARM Linux
2009-11-13 12:42       ` Andy Green
2009-11-13 14:00       ` Bill Gatliff
2009-11-16 14:57         ` [RFC PATCH] ARM: add (experimental) alternative memcpy_{from, to}io() and memset_io() Hubert Feurstein
2009-11-13 15:16       ` ARM: big performance waste in memcpy_{from,to}io Hubert Feurstein
2009-11-13 23:14 ` Ben Dooks

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200911121749.49676.h.feurstein@gmail.com \
    --to=h.feurstein@gmail.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).