* flash read performance
@ 2008-10-28 10:14 Andre Puschmann
2008-10-29 11:42 ` Josh Boyer
0 siblings, 1 reply; 19+ messages in thread
From: Andre Puschmann @ 2008-10-28 10:14 UTC (permalink / raw)
To: linux-mtd
Hi list,
I am currently trying to improve the flash read performance of my
platform. It's a gumstix verdex board with a pxa270 running at 400MHz.
My flash is a 16MB NOR Intel StrataFlash P30 (128P30T) and it's
operating in the _normal_ asynchronous mode. In my opinion the read
performance is very poor, only around 1.2 to 1.4 MB/s depending on the
blocksize. I think it should be possible to get much higher transfer rates.
In Linux, I ran my tests with dd like this (copy 10MB):
time dd if=/dev/mtd5 of=/dev/null bs=16k count=640
640+0 records in
640+0 records out
real 0m 7.17s
user 0m 0.00s
sys 0m 7.17s
Running top in another console brings up, that the CPU load is very high
during copy. I am not sure if the system is doing some sort of busy
waiting or something like that? However, it should be possible to do a
copy without having such a high load.
Mem: 17684K used, 45144K free, 0K shrd, 0K buff, 11164K cached
CPU: 0% usr 100% sys 0% nice 0% idle 0% io 0% irq 0% softirq
Load average: 0.10 0.17 0.09
PID PPID USER STAT VSZ %MEM %CPU COMMAND
259 258 root R 1100 2% 95% dd if /dev/mtd5 of /dev/null
bs 16k co
I guess a number of people are using a similar/comparable setup. So some
kind of user benchmark would be nice.
I am sure this needs some more investigation, but any comment/hint is
more than welcome.
Best regards,
Andre
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance
2008-10-28 10:14 flash read performance Andre Puschmann
@ 2008-10-29 11:42 ` Josh Boyer
2008-10-29 12:03 ` Jamie Lokier
2008-10-29 15:52 ` Andre Puschmann
0 siblings, 2 replies; 19+ messages in thread
From: Josh Boyer @ 2008-10-29 11:42 UTC (permalink / raw)
To: Andre Puschmann; +Cc: linux-mtd
On Tue, Oct 28, 2008 at 11:14:05AM +0100, Andre Puschmann wrote:
>Hi list,
>
>I am currently trying to improve the flash read performance of my
>platform. It's a gumstix verdex board with a pxa270 running at 400MHz.
>My flash is a 16MB NOR Intel StrataFlash P30 (128P30T) and it's
>operating in the _normal_ asynchronous mode. In my opinion the read
>performance is very poor, only around 1.2 to 1.4 MB/s depending on the
>blocksize. I think it should be possible to get much higher transfer rates.
Why do you think that?
>In Linux, I ran my tests with dd like this (copy 10MB):
>time dd if=/dev/mtd5 of=/dev/null bs=16k count=640
>640+0 records in
>640+0 records out
>real 0m 7.17s
>user 0m 0.00s
>sys 0m 7.17s
>
>
>Running top in another console brings up, that the CPU load is very high
>during copy. I am not sure if the system is doing some sort of busy
>waiting or something like that? However, it should be possible to do a
>copy without having such a high load.
Why do you think that? The chip drivers don't do DMA, so all I/O goes
through the CPU.
josh
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance
2008-10-29 11:42 ` Josh Boyer
@ 2008-10-29 12:03 ` Jamie Lokier
2008-10-29 15:52 ` Andre Puschmann
1 sibling, 0 replies; 19+ messages in thread
From: Jamie Lokier @ 2008-10-29 12:03 UTC (permalink / raw)
To: Josh Boyer; +Cc: Andre Puschmann, linux-mtd
Josh Boyer wrote:
> On Tue, Oct 28, 2008 at 11:14:05AM +0100, Andre Puschmann wrote:
> >Hi list,
> >
> >I am currently trying to improve the flash read performance of my
> >platform. It's a gumstix verdex board with a pxa270 running at 400MHz.
> >My flash is a 16MB NOR Intel StrataFlash P30 (128P30T) and it's
> >operating in the _normal_ asynchronous mode. In my opinion the read
> >performance is very poor, only around 1.2 to 1.4 MB/s depending on the
> >blocksize. I think it should be possible to get much higher transfer rates.
>
> Why do you think that?
Take a look at:
http://marc.info/?l=linux-embedded&m=122125638419881&w=2
http://marc.info/?l=linux-embedded&m=122149685932525&w=2
http://marc.info/?l=linux-embedded&m=122185227511786&w=2
-- Jamie
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance
2008-10-29 11:42 ` Josh Boyer
2008-10-29 12:03 ` Jamie Lokier
@ 2008-10-29 15:52 ` Andre Puschmann
2008-10-30 8:33 ` Arnaud Mouiche
1 sibling, 1 reply; 19+ messages in thread
From: Andre Puschmann @ 2008-10-29 15:52 UTC (permalink / raw)
To: linux-mtd
Josh Boyer schrieb:
>> In my opinion the read
>> performance is very poor, only around 1.2 to 1.4 MB/s depending on the
>> blocksize. I think it should be possible to get much higher transfer rates.
>
> Why do you think that?
I guess there is something wrong with the timing parameters and/or the
way the CPU core speaks to the flash controller, which results in long
wait-states.
But at least for my understanding, these transfer rates have nothign to
do with _high speed NOR flashes_ :-)
> Why do you think that? The chip drivers don't do DMA, so all I/O goes
> through the CPU.
Yes, DMA is not used. However, the CPU should be strong enough to do
this transfer faster. On the other hand, my understanding is, DMA brings
no speed improvements in all cases. It boosts memory transfers without
adding an extra overhead to the CPU. But in this case, copy data is the
only task.
Regards,
Andre
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance
2008-10-29 15:52 ` Andre Puschmann
@ 2008-10-30 8:33 ` Arnaud Mouiche
2008-10-30 9:52 ` Andre Puschmann
0 siblings, 1 reply; 19+ messages in thread
From: Arnaud Mouiche @ 2008-10-30 8:33 UTC (permalink / raw)
To: linux-mtd
Hi,
I was faced with the same wondering in the past : bootloader NOR access
was really much faster that Linux one.
Yes, no DMA was used (but the same on bootloader, and anyway that
doesn't impact the data rate, only the CPU load), but even worse, Linux
code was using memcpy_fromio which a basic byte by byte loop copy in the
default ARM implementation.
May be your issues are the same...
Regards,
arnaud
Andre Puschmann a écrit :
> Josh Boyer schrieb:
>
>>> In my opinion the read
>>> performance is very poor, only around 1.2 to 1.4 MB/s depending on the
>>> blocksize. I think it should be possible to get much higher transfer rates.
>>>
>> Why do you think that?
>>
>
> I guess there is something wrong with the timing parameters and/or the
> way the CPU core speaks to the flash controller, which results in long
> wait-states.
> But at least for my understanding, these transfer rates have nothign to
> do with _high speed NOR flashes_ :-)
>
>
>
>> Why do you think that? The chip drivers don't do DMA, so all I/O goes
>> through the CPU.
>>
>
> Yes, DMA is not used. However, the CPU should be strong enough to do
> this transfer faster. On the other hand, my understanding is, DMA brings
> no speed improvements in all cases. It boosts memory transfers without
> adding an extra overhead to the CPU. But in this case, copy data is the
> only task.
>
>
> Regards,
> Andre
>
>
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/
>
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance
2008-10-30 8:33 ` Arnaud Mouiche
@ 2008-10-30 9:52 ` Andre Puschmann
2008-10-30 10:06 ` Arnaud Mouiche
0 siblings, 1 reply; 19+ messages in thread
From: Andre Puschmann @ 2008-10-30 9:52 UTC (permalink / raw)
To: linux-mtd
Hi,
> I was faced with the same wondering in the past : bootloader NOR access
> was really much faster that Linux one.
About how much faster? It really depends on the access method. I am
using u-boot and if I use the basic cp.b routine its about the same
_slow_ speed. I tried to use the asm-optimised memcpy routine that the
kernel has. This is much faster, around 5MB/s.
> Yes, no DMA was used (but the same on bootloader, and anyway that
> doesn't impact the data rate, only the CPU load), but even worse, Linux
> code was using memcpy_fromio which a basic byte by byte loop copy in the
> default ARM implementation.
Yes, memcpy_fromio is quite slow. But using normal memcpy is not
suggested, only use writel()/readl() and memcpy_[from|to]io().
I am not sure about the right _fast_ way to to such copies.
Regards,
Andre
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance
2008-10-30 9:52 ` Andre Puschmann
@ 2008-10-30 10:06 ` Arnaud Mouiche
2008-11-03 14:23 ` Andre Puschmann
0 siblings, 1 reply; 19+ messages in thread
From: Arnaud Mouiche @ 2008-10-30 10:06 UTC (permalink / raw)
To: Andre Puschmann; +Cc: linux-mtd
I was using redboot, configured to use the optimized memcpy (yes, it
gives the choice at configuration time)
on kernel side, I just hack memcpy_fromio to add a "weak" attribute, and
rewrite it to directly use the linux optimized memcpy (shame on me for
this "not suggested" methode, but speed was my goal)
after that, performances are equal between bootloader and linux, and
really near the one reached by a DMA access, which is also the
performances we can calculate from FLASH time access configuration.
arnaud
Andre Puschmann a écrit :
> Hi,
>
>
>> I was faced with the same wondering in the past : bootloader NOR access
>> was really much faster that Linux one.
>>
>
> About how much faster? It really depends on the access method. I am
> using u-boot and if I use the basic cp.b routine its about the same
> _slow_ speed. I tried to use the asm-optimised memcpy routine that the
> kernel has. This is much faster, around 5MB/s.
>
>
>> Yes, no DMA was used (but the same on bootloader, and anyway that
>> doesn't impact the data rate, only the CPU load), but even worse, Linux
>> code was using memcpy_fromio which a basic byte by byte loop copy in the
>> default ARM implementation.
>>
>
> Yes, memcpy_fromio is quite slow. But using normal memcpy is not
> suggested, only use writel()/readl() and memcpy_[from|to]io().
>
> I am not sure about the right _fast_ way to to such copies.
>
>
> Regards,
> Andre
>
>
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/
>
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance
2008-10-30 10:06 ` Arnaud Mouiche
@ 2008-11-03 14:23 ` Andre Puschmann
2008-11-04 8:30 ` Andre Puschmann
2008-11-04 11:42 ` Jamie Lokier
0 siblings, 2 replies; 19+ messages in thread
From: Andre Puschmann @ 2008-11-03 14:23 UTC (permalink / raw)
To: linux-mtd; +Cc: linux-mtd
Hi,
I spent some more time on this issue and investigated some mtd-maps
drivers.
The kernel I am using is a 2.6.21 that comes out of the gumstix
svn-repo. Unfortunately, it uses a legacy driver which only does a
ioremap() but no ioremap_nocache(). Patching the driver with this
additional call boosts up transfers up to around 5.5MB/s, which
is a fairly improvement.
I will send a patch to the gumstix list. Users of newer kernel might
not need this, as they use a newer driver (pxa2xx-flash.c) anyway.
But I am wondering if things still can go faster?!
Jamie, do you some information about the speed I can expect
theoretically? Or do I have to switch over to another operation
mode (i.e. async) for higher speeds?
Thanks in advance.
Best regards,
Andre
Arnaud Mouiche schrieb:
> I was using redboot, configured to use the optimized memcpy (yes, it
> gives the choice at configuration time)
> on kernel side, I just hack memcpy_fromio to add a "weak" attribute, and
> rewrite it to directly use the linux optimized memcpy (shame on me for
> this "not suggested" methode, but speed was my goal)
>
> after that, performances are equal between bootloader and linux, and
> really near the one reached by a DMA access, which is also the
> performances we can calculate from FLASH time access configuration.
>
> arnaud
>
> Andre Puschmann a écrit :
>> Hi,
>>
>>
>>> I was faced with the same wondering in the past : bootloader NOR access
>>> was really much faster that Linux one.
>>>
>> About how much faster? It really depends on the access method. I am
>> using u-boot and if I use the basic cp.b routine its about the same
>> _slow_ speed. I tried to use the asm-optimised memcpy routine that the
>> kernel has. This is much faster, around 5MB/s.
>>
>>
>>> Yes, no DMA was used (but the same on bootloader, and anyway that
>>> doesn't impact the data rate, only the CPU load), but even worse, Linux
>>> code was using memcpy_fromio which a basic byte by byte loop copy in the
>>> default ARM implementation.
>>>
>> Yes, memcpy_fromio is quite slow. But using normal memcpy is not
>> suggested, only use writel()/readl() and memcpy_[from|to]io().
>>
>> I am not sure about the right _fast_ way to to such copies.
>>
>>
>> Regards,
>> Andre
>>
>>
>> ______________________________________________________
>> Linux MTD discussion mailing list
>> http://lists.infradead.org/mailman/listinfo/linux-mtd/
>>
>>
>
>
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance
2008-11-03 14:23 ` Andre Puschmann
@ 2008-11-04 8:30 ` Andre Puschmann
2008-11-04 11:42 ` Jamie Lokier
1 sibling, 0 replies; 19+ messages in thread
From: Andre Puschmann @ 2008-11-04 8:30 UTC (permalink / raw)
To: linux-mtd
Hi,
Andre Puschmann schrieb:
> Unfortunately, it uses a legacy driver which only does a
> ioremap() but no ioremap_nocache().
Sorry, there was a typo in this statement.
It should be ioremap_cached().
Regards,
Andre
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance
2008-11-03 14:23 ` Andre Puschmann
2008-11-04 8:30 ` Andre Puschmann
@ 2008-11-04 11:42 ` Jamie Lokier
2008-11-04 14:31 ` Andre Puschman
1 sibling, 1 reply; 19+ messages in thread
From: Jamie Lokier @ 2008-11-04 11:42 UTC (permalink / raw)
To: Andre Puschmann; +Cc: Arnaud Mouiche, linux-mtd
Andre Puschmann wrote:
> Hi,
>
> I spent some more time on this issue and investigated some mtd-maps
> drivers.
> The kernel I am using is a 2.6.21 that comes out of the gumstix
> svn-repo. Unfortunately, it uses a legacy driver which only does a
> ioremap() but no ioremap_nocache(). Patching the driver with this
> additional call boosts up transfers up to around 5.5MB/s, which
> is a fairly improvement.
> I will send a patch to the gumstix list. Users of newer kernel might
> not need this, as they use a newer driver (pxa2xx-flash.c) anyway.
I don't know much about this area, but will _writing_ to the flash
work reliably if ioremap_cached() is used?
-- Jamie
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance
2008-11-04 11:42 ` Jamie Lokier
@ 2008-11-04 14:31 ` Andre Puschman
2008-11-07 2:41 ` Trent Piepho
0 siblings, 1 reply; 19+ messages in thread
From: Andre Puschman @ 2008-11-04 14:31 UTC (permalink / raw)
To: Jamie Lokier; +Cc: Arnaud Mouiche, linux-mtd
Jamie Lokier schrieb:
> I don't know much about this area, but will _writing_ to the flash
> work reliably if ioremap_cached() is used?
>
> -- Jamie
>
Good point. I only was into reading and so I totally forgot writing ;-)
I gave it a try, although it was terribly slow (only a few kb/s), it worked.
I just did a cp uImage /dev/mtd3. On the other hand, I never tried writing
with the old driver. So I don't know if this is faster.
I also did some more testing with my improved flash-timing parameters,
which yields to read speeds of up to 18-19MB/s, which is really fast
compared
to 1,3MB/s at the beginning :-)
So for now, this is my result:
- cache and well chosen flash-timings have great impact on (at least)
read performance
But I think transfer rates > 20MB are possible ..
Anyway, with these results, booting the complete system in nearly (or
even less than) 2s should be possible.
Regards,
Andre
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance
2008-11-04 14:31 ` Andre Puschman
@ 2008-11-07 2:41 ` Trent Piepho
2008-11-07 4:02 ` Jamie Lokier
2008-11-07 9:47 ` Andre Puschmann
0 siblings, 2 replies; 19+ messages in thread
From: Trent Piepho @ 2008-11-07 2:41 UTC (permalink / raw)
To: Andre Puschman; +Cc: MTD mailing list, Arnaud Mouiche, Jamie Lokier
On Tue, 4 Nov 2008, Andre Puschman wrote:
> Jamie Lokier schrieb:
>> I don't know much about this area, but will _writing_ to the flash
>> work reliably if ioremap_cached() is used?
>>
>> -- Jamie
>>
>
> Good point. I only was into reading and so I totally forgot writing ;-)
> I gave it a try, although it was terribly slow (only a few kb/s), it worked.
> I just did a cp uImage /dev/mtd3. On the other hand, I never tried writing
> with the old driver. So I don't know if this is faster.
I've found that writes do not work with caching enabled. When the CPU writes
to the flash and then reads it back, it gets returned what it wrote. That's
not what is supposed to happen. For example, to program flash word 'i' to
the value 'val' using the Spansion/AMD method, you do this:
flash[0] = 0xf0f0;
flash[0x555] = 0xaaaa;
flash[0x2aa] = 0x5555;
flash[0x555] = 0xa0a0;
flash[i] = val;
while(flash[i] != val); /* wait for it to finish */
After this flash[0] should be whatever data was there before, not 0xf0f0.
Same with flash[0x555] and the rest. Only flash[i] should be modified. But
if flash is cached, the cpu will use the cached values and think flash[0] is
0xf0f0 until the cache gets flushed.
> I also did some more testing with my improved flash-timing parameters,
> which yields to read speeds of up to 18-19MB/s, which is really fast
> compared
> to 1,3MB/s at the beginning :-)
My results, from a mpc8572 (powerpc) with a spansion s96gl064n flash chip on a
100 MHz bus.
Mapping Speed (MB/sec) (MB = 1048576 bytes)
un-cached and guarded 12.30
cached and gaurded 14.24
cached and un-guarded 14.31
un-cached and un-guarded 14.66
I measured by reading flash linearly from beginning to end 32-bits at a time.
Since the flash is bigger than the cache, ever read should have come from the
flash. If I just read the same 1k over and over that would obviously be much
faster if it could come from the cache.
I'm just using the GPCM mode of the Freescale eLBC, which means I have to use
the same timings both for writes and reads. There are parts of the timing I
could make faster for reads, but then they would be too short for writes, and
vice versa. It also means I can't use the page burst mode, which would
speed up reads significantly.
> Anyway, with these results, booting the complete system in nearly (or
> even less than) 2s should be possible.
The biggest bootup delay I have now is waiting for the ethernet phy to get
online, which takes almost 3 seconds.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance
2008-11-07 2:41 ` Trent Piepho
@ 2008-11-07 4:02 ` Jamie Lokier
2008-11-07 5:36 ` Trent Piepho
2008-11-07 9:47 ` Andre Puschmann
1 sibling, 1 reply; 19+ messages in thread
From: Jamie Lokier @ 2008-11-07 4:02 UTC (permalink / raw)
To: Trent Piepho; +Cc: Arnaud Mouiche, Andre Puschman, MTD mailing list
Trent Piepho wrote:
> On Tue, 4 Nov 2008, Andre Puschman wrote:
> > Jamie Lokier schrieb:
> >> I don't know much about this area, but will _writing_ to the flash
> >> work reliably if ioremap_cached() is used?
> >>
> > Good point. I only was into reading and so I totally forgot writing ;-)
> > I gave it a try, although it was terribly slow (only a few kb/s), it worked.
> > I just did a cp uImage /dev/mtd3. On the other hand, I never tried writing
> > with the old driver. So I don't know if this is faster.
>
> I've found that writes do not work with caching enabled. When the CPU writes
> to the flash and then reads it back, it gets returned what it wrote.
Thanks, both.
Based on Andre's observation, I will soon try enabling cache for my
NOR, and see if it makes a difference to cold-cache read performance.
I don't expect it, but it's worth a try.
If that helps significantly, then I'll look at doing writes properly.
> That's not what is supposed to happen. For example, to program
> flash word 'i' to the value 'val' using the Spansion/AMD method, you
> do this:
>
> flash[0] = 0xf0f0;
> flash[0x555] = 0xaaaa;
> flash[0x2aa] = 0x5555;
> flash[0x555] = 0xa0a0;
> flash[i] = val;
> while(flash[i] != val); /* wait for it to finish */
>
> After this flash[0] should be whatever data was there before, not 0xf0f0.
> Same with flash[0x555] and the rest. Only flash[i] should be modified. But
> if flash is cached, the cpu will use the cached values and think flash[0] is
> 0xf0f0 until the cache gets flushed.
You might also find the write operation to be unreliable, if the
caching mode is write-back rather than write-through.
Really, you should use an uncached mapping to write commands to the
flash, flush the cached mapping (for reads) when commands are written,
and prevent any access during the writes (this is in MTD normally).
You could optimise by flushing only the cached read regions which are
affected by write and erase commands.
> > I also did some more testing with my improved flash-timing parameters,
> > which yields to read speeds of up to 18-19MB/s, which is really fast
> > compared
> > to 1,3MB/s at the beginning :-)
>
> My results, from a mpc8572 (powerpc) with a spansion s96gl064n flash
> chip on a 100 MHz bus.
>
> Mapping Speed (MB/sec) (MB = 1048576 bytes)
> un-cached and guarded 12.30
> cached and gaurded 14.24
> cached and un-guarded 14.31
> un-cached and un-guarded 14.66
>
> I measured by reading flash linearly from beginning to end 32-bits at a time.
> Since the flash is bigger than the cache, ever read should have come from the
> flash. If I just read the same 1k over and over that would obviously be much
> faster if it could come from the cache.
That's nice to see that cache helps cold-read performance too, not
just cached reads. Thanks :-)
> The biggest bootup delay I have now is waiting for the ethernet phy to get
> online, which takes almost 3 seconds.
Do you need to delay everything else for that, or can you parallelise?
-- Jamie
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance
2008-11-07 4:02 ` Jamie Lokier
@ 2008-11-07 5:36 ` Trent Piepho
2008-11-07 5:57 ` Jamie Lokier
0 siblings, 1 reply; 19+ messages in thread
From: Trent Piepho @ 2008-11-07 5:36 UTC (permalink / raw)
To: Jamie Lokier; +Cc: Arnaud Mouiche, Andre Puschman, MTD mailing list
On Fri, 7 Nov 2008, Jamie Lokier wrote:
> Based on Andre's observation, I will soon try enabling cache for my
> NOR, and see if it makes a difference to cold-cache read performance.
> I don't expect it, but it's worth a try.
It possible it could help by efficiently doing the reads back-to-back with
no wasted cycles between them. I think it's necessary if you want to
benefit from page mode, but that's something I haven't tried yet.
> You might also find the write operation to be unreliable, if the
> caching mode is write-back rather than write-through.
Proper use of "sync" instructions, or whatever the arch uses to insure that
writel() is strictly ordered should fix that.
> Really, you should use an uncached mapping to write commands to the
> flash, flush the cached mapping (for reads) when commands are written,
> and prevent any access during the writes (this is in MTD normally).
> You could optimise by flushing only the cached read regions which are
> affected by write and erase commands.
Yes, that is probably the best. Most NOR flash writing is so slow that the
cache flushes shouldn't be too expensive.
>> Mapping Speed (MB/sec) (MB = 1048576 bytes)
>> un-cached and guarded 12.30
>> cached and gaurded 14.24
>> cached and un-guarded 14.31
>> un-cached and un-guarded 14.66
>>
>> I measured by reading flash linearly from beginning to end 32-bits at a time.
>> Since the flash is bigger than the cache, ever read should have come from the
>> flash. If I just read the same 1k over and over that would obviously be much
>> faster if it could come from the cache.
>
> That's nice to see that cache helps cold-read performance too, not
> just cached reads. Thanks :-)
Though if the mapping is not in guarded mode, turning cache on hurts
performance. That surprises me.
>> The biggest bootup delay I have now is waiting for the ethernet phy to get
>> online, which takes almost 3 seconds.
>
> Do you need to delay everything else for that, or can you parallelise?
It's parallel. I start the phy very early in the boot loader, and linux is
done booting and sitting in userspace waiting for it for a few hundred ms
before it's ready.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance
2008-11-07 5:36 ` Trent Piepho
@ 2008-11-07 5:57 ` Jamie Lokier
0 siblings, 0 replies; 19+ messages in thread
From: Jamie Lokier @ 2008-11-07 5:57 UTC (permalink / raw)
To: Trent Piepho; +Cc: Arnaud Mouiche, Andre Puschman, MTD mailing list
Trent Piepho wrote:
> On Fri, 7 Nov 2008, Jamie Lokier wrote:
> > Based on Andre's observation, I will soon try enabling cache for my
> > NOR, and see if it makes a difference to cold-cache read performance.
> > I don't expect it, but it's worth a try.
>
> It possible it could help by efficiently doing the reads back-to-back with
> no wasted cycles between them. I think it's necessary if you want to
> benefit from page mode, but that's something I haven't tried yet.
>
> > You might also find the write operation to be unreliable, if the
> > caching mode is write-back rather than write-through.
>
> Proper use of "sync" instructions, or whatever the arch uses to insure that
> writel() is strictly ordered should fix that.
I think that won't work because strongly ordered writes don't
translate directly to bus transactions when the region is mapped with
ioremap_cached(). The flash wants exact bus transactions.
-- Jamie
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance
2008-11-07 2:41 ` Trent Piepho
2008-11-07 4:02 ` Jamie Lokier
@ 2008-11-07 9:47 ` Andre Puschmann
2008-11-08 5:28 ` Trent Piepho
1 sibling, 1 reply; 19+ messages in thread
From: Andre Puschmann @ 2008-11-07 9:47 UTC (permalink / raw)
To: linux-mtd
Hi,
Trent Piepho wrote:
> I've found that writes do not work with caching enabled. When the CPU writes
> to the flash and then reads it back, it gets returned what it wrote. That's
> not what is supposed to happen.
Sure, writes should be uncached and unbuffered. But I thought the mtd
layer handles this correctly as there are two different ioremap's in the
driver:
map.virt = ioremap(..);
map.cached = ioremap_cached(..);
map.inval_cache = inval_cache_fct();
So, calling inval_cache_fct() just before any write operation and then
using the uncached mapping should do the trick, no? On the other hand, I
am not sure if the mtd-layer really behaves like that. Can somebody
confirm this?
Anyway, I tried to write a cramfs-image to a previously (in uboot)
erased flash area. After that, I could successfully boot the system
using this cramfs as my root. So I would reason that writes are OK.
Time is another point (cramfs_xip.bin is 1.6MB):
# time cp cramfs_xip.bin /dev/mtd3
real 4m 13.52s
user 0m 0.00s
sys 4m 13.03s
This is around 6,3kB/s. Doing the same write in uboot with cp.b takes
about 26sec. So this is around 62kB/s.
> I measured by reading flash linearly from beginning to end 32-bits at a time.
> Since the flash is bigger than the cache, ever read should have come from the
> flash. If I just read the same 1k over and over that would obviously be much
> faster if it could come from the cache.
I measured by reading from mtd char device, which is not reading the
same data over and over again.
I copied the whole partition into a ramdisk. mtd3 is 3MB in size so this
yields to a read speed of 11.53MB/s
# time cp /dev/mtd3 /tmp/test
real 0m 0.26s
user 0m 0.01s
sys 0m 0.24s
> I'm just using the GPCM mode of the Freescale eLBC, which means I have to use
> the same timings both for writes and reads. There are parts of the timing I
> could make faster for reads, but then they would be too short for writes, and
> vice versa. It also means I can't use the page burst mode, which would
> speed up reads significantly.
I am not familiar with GPCM and eLBC, but it sounds about the same here.
Same timings for writes and reads. But I use burst mode (4 words), but
this only applies to reads.
> The biggest bootup delay I have now is waiting for the ethernet phy to get
> online, which takes almost 3 seconds.
Same here, ethernet phy takes sooo long. Here is what I do: I am using a
parallel init. On script is just for loading the ethernet-module which
is done next to the other scripts. So 2sec is cheated, cause ethernet
isn't really available at this point of time.
Regards,
Andre
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance
2008-11-07 9:47 ` Andre Puschmann
@ 2008-11-08 5:28 ` Trent Piepho
2008-11-11 13:28 ` Andre Puschmann
0 siblings, 1 reply; 19+ messages in thread
From: Trent Piepho @ 2008-11-08 5:28 UTC (permalink / raw)
To: Andre Puschmann; +Cc: Jamie Lokier, MTD mailing list
On Fri, 7 Nov 2008, Andre Puschmann wrote:
> Trent Piepho wrote:
>> I've found that writes do not work with caching enabled. When the CPU writes
>> to the flash and then reads it back, it gets returned what it wrote. That's
>> not what is supposed to happen.
>
> Sure, writes should be uncached and unbuffered. But I thought the mtd
> layer handles this correctly as there are two different ioremap's in the
> driver:
>
> map.virt = ioremap(..);
> map.cached = ioremap_cached(..);
> map.inval_cache = inval_cache_fct();
It depends on what mapping driver you're using. It looks like only the
pxa2xx driver uses map.cached. The physmap or of_physmap drivers that I'm
using don't use it.
> So, calling inval_cache_fct() just before any write operation and then
> using the uncached mapping should do the trick, no? On the other hand, I
> am not sure if the mtd-layer really behaves like that. Can somebody
> confirm this?
>From looking at the code I'd say you're right.
> I am not familiar with GPCM and eLBC, but it sounds about the same here.
> Same timings for writes and reads. But I use burst mode (4 words), but
> this only applies to reads.
I've switched from GPCM to UPM, which lets me use different timings for
read and write as well as use burst mode.
In non-cached and guarded mode, I now get 13.61 vs 12.30 MB/s. That's just
from slightly better timings because I could make them different for read
vs write. The big difference is cached and non-guarded reads, which went
to 44.79 MB/s from 14.24 MB/s. That boost is from using burst mode.
So the answer is yes, turning on cache can boost cold-cache performance, if
doing so lets you use page burst mode. It makes a huge difference in fact!
>> The biggest bootup delay I have now is waiting for the ethernet phy to get
>> online, which takes almost 3 seconds.
>
> Same here, ethernet phy takes sooo long. Here is what I do: I am using a
> parallel init. On script is just for loading the ethernet-module which
> is done next to the other scripts. So 2sec is cheated, cause ethernet
> isn't really available at this point of time.
It might be the case that the ethernet module resets the PHY when it loads
and/or when the ethernet device is opened. That was a problem I was
having. The PHY would almost be done when the dhcp client would run and
open eth0, which would start the phy all over again.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance
2008-11-08 5:28 ` Trent Piepho
@ 2008-11-11 13:28 ` Andre Puschmann
2008-11-15 2:02 ` Trent Piepho
0 siblings, 1 reply; 19+ messages in thread
From: Andre Puschmann @ 2008-11-11 13:28 UTC (permalink / raw)
To: linux-mtd
Hi Trant,
Trent Piepho schrieb:
>> map.virt = ioremap(..);
>> map.cached = ioremap_cached(..);
>> map.inval_cache = inval_cache_fct();
>
> It depends on what mapping driver you're using. It looks like only the
> pxa2xx driver uses map.cached. The physmap or of_physmap drivers that I'm
> using don't use it.
The flashmap-drive I am using is custom one made for the gumstix-board.
However, it is almost identically to pxa2xx-driver from newer kernels.
> I've switched from GPCM to UPM, which lets me use different timings for
> read and write as well as use burst mode.
>
> In non-cached and guarded mode, I now get 13.61 vs 12.30 MB/s. That's just
> from slightly better timings because I could make them different for read
> vs write. The big difference is cached and non-guarded reads, which went
> to 44.79 MB/s from 14.24 MB/s. That boost is from using burst mode.
Whop, this is great news. Btw. do you drive your flash in asynchronous
or in synchronous mode? Do you have an extra flash configuration
register that you need to modify in order to use the burst mode?
My intel NOR flash has an extra read configuration register (RCR).
However, for some reason I am not able to read/modify/read this register
successfully.
> It might be the case that the ethernet module resets the PHY when it
loads
> and/or when the ethernet device is opened. That was a problem I was
> having. The PHY would almost be done when the dhcp client would run and
> open eth0, which would start the phy all over again.
Thanks for that hint. I'll downgrade this as a later task to do.
Regards,
Andre
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance
2008-11-11 13:28 ` Andre Puschmann
@ 2008-11-15 2:02 ` Trent Piepho
0 siblings, 0 replies; 19+ messages in thread
From: Trent Piepho @ 2008-11-15 2:02 UTC (permalink / raw)
To: Andre Puschmann; +Cc: linux-mtd
On Tue, 11 Nov 2008, Andre Puschmann wrote:
> Trent Piepho schrieb:
>>> map.virt = ioremap(..);
>>> map.cached = ioremap_cached(..);
>>> map.inval_cache = inval_cache_fct();
>>
>> It depends on what mapping driver you're using. It looks like only the
>> pxa2xx driver uses map.cached. The physmap or of_physmap drivers that I'm
>> using don't use it.
>
> The flashmap-drive I am using is custom one made for the gumstix-board.
> However, it is almost identically to pxa2xx-driver from newer kernels.
I added support to physmap_of for using map.cached on ppc32, seems to be
working so far.
But, it turns out it doesn't work for XIP. The flash drivers that support XIP
implement a ->point() method that something like AXFS or cramfs+xip use to
mmap the flash. But these just return pointers to the uncached mapping.
>> In non-cached and guarded mode, I now get 13.61 vs 12.30 MB/s. That's just
>> from slightly better timings because I could make them different for read
>> vs write. The big difference is cached and non-guarded reads, which went
>> to 44.79 MB/s from 14.24 MB/s. That boost is from using burst mode.
>
> Whop, this is great news. Btw. do you drive your flash in asynchronous
> or in synchronous mode? Do you have an extra flash configuration
> register that you need to modify in order to use the burst mode?
> My intel NOR flash has an extra read configuration register (RCR).
> However, for some reason I am not able to read/modify/read this register
> successfully.
In asynchronous mode. I didn't have to program anything special in the flash
chip, just program the localbus controller to use page burst transfers.
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2008-11-15 2:05 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-28 10:14 flash read performance Andre Puschmann
2008-10-29 11:42 ` Josh Boyer
2008-10-29 12:03 ` Jamie Lokier
2008-10-29 15:52 ` Andre Puschmann
2008-10-30 8:33 ` Arnaud Mouiche
2008-10-30 9:52 ` Andre Puschmann
2008-10-30 10:06 ` Arnaud Mouiche
2008-11-03 14:23 ` Andre Puschmann
2008-11-04 8:30 ` Andre Puschmann
2008-11-04 11:42 ` Jamie Lokier
2008-11-04 14:31 ` Andre Puschman
2008-11-07 2:41 ` Trent Piepho
2008-11-07 4:02 ` Jamie Lokier
2008-11-07 5:36 ` Trent Piepho
2008-11-07 5:57 ` Jamie Lokier
2008-11-07 9:47 ` Andre Puschmann
2008-11-08 5:28 ` Trent Piepho
2008-11-11 13:28 ` Andre Puschmann
2008-11-15 2:02 ` Trent Piepho
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox