ARM: big performance waste in memcpy_{from,to}io

All of lore.kernel.org
 help / color / mirror / Atom feed

From: h.feurstein@gmail.com (Hubert Feurstein)
To: linux-arm-kernel@lists.infradead.org
Subject: ARM: big performance waste in memcpy_{from,to}io
Date: Thu, 12 Nov 2009 17:49:49 +0100	[thread overview]
Message-ID: <200911121749.49676.h.feurstein@gmail.com> (raw)

Hi Russel,

I'm working with an Contec Micro9 board (ep93xx-based with two Spansion-NOR-
Flash chips in parallel => 32bit memory-buswidth) and was wondering why the
read-performance of the flash (through /dev/mtd*) is so quite poor. So I 
connected a logic analyser to the data- and address-bus and recognized that 
the accesses to the same flash-word-address happens four times. This means 
that the flash is read byte-by-byte, which is IMO a big waste of performance 
since it would be possible to read the full word (four bytes) at once. So I 
digged around in the mtd-driver and found the function "memcpy_fromio" which 
is called to read the flash data. I was really surprised when looked to the 
implementation, which is:

arch/arm/kernel/io.c:

/*
 * Copy data from IO memory space to "real" memory space.
 * This needs to be optimized.
 */
void _memcpy_fromio(void *to, const volatile void __iomem *from, size_t count)
{
	unsigned char *t = to;
	while (count) {
		count--;
		*t = readb(from);
		t++;
		from++;
	}
}

Ok, with this poor memcpy-implementation the poor flash-read-performance is
fully explainable. So I tried to fix this. I found the real "memcpy"
implementation which is written in assemler and seems to be quite optimized.
So I changed the the code to this:

Index: linux-2.6.31/arch/arm/include/asm/io.h
===================================================================
--- linux-2.6.31.orig/arch/arm/include/asm/io.h
+++ linux-2.6.31/arch/arm/include/asm/io.h
@@ -195,9 +195,9 @@ extern void _memset_io(volatile void __i
 #define writesw(p,d,l)		__raw_writesw(__mem_pci(p),d,l)
 #define writesl(p,d,l)		__raw_writesl(__mem_pci(p),d,l)

-#define memset_io(c,v,l)	_memset_io(__mem_pci(c),(v),(l))
-#define memcpy_fromio(a,c,l)	_memcpy_fromio((a),__mem_pci(c),(l))
-#define memcpy_toio(c,a,l)	_memcpy_toio(__mem_pci(c),(a),(l))
+#define memset_io(c,v,l)	memset(__mem_pci(c),(v),(l))
+#define memcpy_fromio(a,c,l)	memcpy((a),__mem_pci(c),(l))
+#define memcpy_toio(c,a,l)	memcpy(__mem_pci(c),(a),(l))

 #elif !defined(readb)

Because on the ARM architecture there is no difference between io-memspace
and the 'real' memspace so it should work. The following tests show the impact
of this change:

[root at micro9]\# cat /proc/mtd
dev:    size   erasesize  name
mtd0: 00040000 00020000 "RedBoot"
mtd1: 01fa0000 00020000 "test"
mtd2: 0001f000 00020000 "FIS directory"
mtd3: 00001000 00020000 "RedBoot config"

This is the read-time with the original ARM implementation:

[root at micro9]\# time cat /dev/mtd1 > /dev/null
real    0m 7.27s
user    0m 0.00s
sys     0m 7.26s

and here is the read-time with my simple change:

[root at micro9]\# time cat /dev/mtd1 > /dev/null
real    0m 0.96s
user    0m 0.00s
sys     0m 0.95s

Wow, that is more than 7.6-times faster!

Because of the word-accesses to the bus, I can take advantage of the burst-
mode option of the SMC (static memory controller) of the ep93xx which
increased the performance by 35% (0.96s was already measured with burst-mode
enabled). With the byte-accesses of the original implementation the burst-mode
seem to have no influence at all.

I've seen that such "simple and slow" memcpy_{to,from)io implementations exist
in many other architectures. So maybe this is a big potential to improve 
overall io-performance, since a lot of drivers use these memcpy_{to,from)io 
functions.

For testing I used kernel version 2.6.31.

Are there any drawbacks when using the good-and-fast "memcpy" ? On my Micro9-
board everything is running fine so far.

Best Regards,
Hubert

---
Hubert Feurstein
Software-Engineer

Contec Steuerungstechnik & Automation GmbH
Wildbichler Stra?e 2e
6341 Ebbs
Austria
www.contec.at

next             reply	other threads:[~2009-11-12 16:49 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-11-12 16:49 Hubert Feurstein [this message]
2009-11-12 18:44 ` ARM: big performance waste in memcpy_{from,to}io Alexander Clouter
2009-11-13 11:32   ` Hubert Feurstein
2009-11-13 12:24     ` Russell King - ARM Linux
2009-11-13 12:42       ` Andy Green
2009-11-13 14:00       ` Bill Gatliff
2009-11-16 14:57         ` [RFC PATCH] ARM: add (experimental) alternative memcpy_{from, to}io() and memset_io() Hubert Feurstein
2009-11-13 15:16       ` ARM: big performance waste in memcpy_{from,to}io Hubert Feurstein
2009-11-13 23:14 ` Ben Dooks

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200911121749.49676.h.feurstein@gmail.com \
    --to=h.feurstein@gmail.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.