public inbox for linux-mtd@lists.infradead.org
 help / color / mirror / Atom feed
* flash read performance
@ 2008-10-28 10:14 Andre Puschmann
  2008-10-29 11:42 ` Josh Boyer
  0 siblings, 1 reply; 19+ messages in thread
From: Andre Puschmann @ 2008-10-28 10:14 UTC (permalink / raw)
  To: linux-mtd

Hi list,

I am currently trying to improve the flash read performance of my 
platform. It's a gumstix verdex board with a pxa270 running at 400MHz. 
My flash is a 16MB NOR Intel StrataFlash P30 (128P30T) and it's 
operating in the _normal_ asynchronous mode. In my opinion the read 
performance is very poor, only around 1.2 to 1.4 MB/s depending on the 
blocksize. I think it should be possible to get much higher transfer rates.

In Linux, I ran my tests with dd like this (copy 10MB):
time dd if=/dev/mtd5 of=/dev/null bs=16k count=640
640+0 records in
640+0 records out
real    0m 7.17s
user    0m 0.00s
sys     0m 7.17s


Running top in another console brings up, that the CPU load is very high
during copy. I am not sure if the system is doing some sort of busy 
waiting or something like that? However, it should be possible to do a 
copy without having such a high load.

Mem: 17684K used, 45144K free, 0K shrd, 0K buff, 11164K cached
CPU:   0% usr 100% sys   0% nice   0% idle   0% io   0% irq   0% softirq
Load average: 0.10 0.17 0.09
  PID  PPID USER     STAT   VSZ %MEM %CPU COMMAND
  259   258 root     R     1100   2%  95% dd if /dev/mtd5 of /dev/null 
bs 16k co


I guess a number of people are using a similar/comparable setup. So some
kind of user benchmark would be nice.
I am sure this needs some more investigation, but any comment/hint is 
more than welcome.

Best regards,
Andre

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: flash read performance
  2008-10-28 10:14 flash read performance Andre Puschmann
@ 2008-10-29 11:42 ` Josh Boyer
  2008-10-29 12:03   ` Jamie Lokier
  2008-10-29 15:52   ` Andre Puschmann
  0 siblings, 2 replies; 19+ messages in thread
From: Josh Boyer @ 2008-10-29 11:42 UTC (permalink / raw)
  To: Andre Puschmann; +Cc: linux-mtd

On Tue, Oct 28, 2008 at 11:14:05AM +0100, Andre Puschmann wrote:
>Hi list,
>
>I am currently trying to improve the flash read performance of my 
>platform. It's a gumstix verdex board with a pxa270 running at 400MHz. 
>My flash is a 16MB NOR Intel StrataFlash P30 (128P30T) and it's 
>operating in the _normal_ asynchronous mode. In my opinion the read 
>performance is very poor, only around 1.2 to 1.4 MB/s depending on the 
>blocksize. I think it should be possible to get much higher transfer rates.

Why do you think that?

>In Linux, I ran my tests with dd like this (copy 10MB):
>time dd if=/dev/mtd5 of=/dev/null bs=16k count=640
>640+0 records in
>640+0 records out
>real    0m 7.17s
>user    0m 0.00s
>sys     0m 7.17s
>
>
>Running top in another console brings up, that the CPU load is very high
>during copy. I am not sure if the system is doing some sort of busy 
>waiting or something like that? However, it should be possible to do a 
>copy without having such a high load.

Why do you think that?  The chip drivers don't do DMA, so all I/O goes
through the CPU.

josh

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: flash read performance
  2008-10-29 11:42 ` Josh Boyer
@ 2008-10-29 12:03   ` Jamie Lokier
  2008-10-29 15:52   ` Andre Puschmann
  1 sibling, 0 replies; 19+ messages in thread
From: Jamie Lokier @ 2008-10-29 12:03 UTC (permalink / raw)
  To: Josh Boyer; +Cc: Andre Puschmann, linux-mtd

Josh Boyer wrote:
> On Tue, Oct 28, 2008 at 11:14:05AM +0100, Andre Puschmann wrote:
> >Hi list,
> >
> >I am currently trying to improve the flash read performance of my 
> >platform. It's a gumstix verdex board with a pxa270 running at 400MHz. 
> >My flash is a 16MB NOR Intel StrataFlash P30 (128P30T) and it's 
> >operating in the _normal_ asynchronous mode. In my opinion the read 
> >performance is very poor, only around 1.2 to 1.4 MB/s depending on the 
> >blocksize. I think it should be possible to get much higher transfer rates.
> 
> Why do you think that?

Take a look at:

http://marc.info/?l=linux-embedded&m=122125638419881&w=2
http://marc.info/?l=linux-embedded&m=122149685932525&w=2
http://marc.info/?l=linux-embedded&m=122185227511786&w=2

-- Jamie

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: flash read performance
  2008-10-29 11:42 ` Josh Boyer
  2008-10-29 12:03   ` Jamie Lokier
@ 2008-10-29 15:52   ` Andre Puschmann
  2008-10-30  8:33     ` Arnaud Mouiche
  1 sibling, 1 reply; 19+ messages in thread
From: Andre Puschmann @ 2008-10-29 15:52 UTC (permalink / raw)
  To: linux-mtd

Josh Boyer schrieb:
>> In my opinion the read 
>> performance is very poor, only around 1.2 to 1.4 MB/s depending on the 
>> blocksize. I think it should be possible to get much higher transfer rates.
> 
> Why do you think that?

I guess there is something wrong with the timing parameters and/or the 
way the CPU core speaks to the flash controller, which results in long 
wait-states.
But at least for my understanding, these transfer rates have nothign to 
do with _high speed NOR flashes_ :-)


> Why do you think that?  The chip drivers don't do DMA, so all I/O goes
> through the CPU.

Yes, DMA is not used. However, the CPU should be strong enough to do 
this transfer faster. On the other hand, my understanding is, DMA brings 
no speed improvements in all cases. It boosts memory transfers without 
adding an extra overhead to the CPU. But in this case, copy data is the 
only task.


Regards,
Andre

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: flash read performance
  2008-10-29 15:52   ` Andre Puschmann
@ 2008-10-30  8:33     ` Arnaud Mouiche
  2008-10-30  9:52       ` Andre Puschmann
  0 siblings, 1 reply; 19+ messages in thread
From: Arnaud Mouiche @ 2008-10-30  8:33 UTC (permalink / raw)
  To: linux-mtd

Hi,

I was faced with the same wondering in the past : bootloader NOR access 
was really much faster that Linux one.
Yes, no DMA was used (but the same on bootloader, and anyway that 
doesn't impact the data rate, only the CPU load), but even worse, Linux 
code was using memcpy_fromio which a basic byte by byte loop copy in the 
default ARM implementation.

May be your issues are the same...

Regards,
arnaud

Andre Puschmann a écrit :
> Josh Boyer schrieb:
>   
>>> In my opinion the read 
>>> performance is very poor, only around 1.2 to 1.4 MB/s depending on the 
>>> blocksize. I think it should be possible to get much higher transfer rates.
>>>       
>> Why do you think that?
>>     
>
> I guess there is something wrong with the timing parameters and/or the 
> way the CPU core speaks to the flash controller, which results in long 
> wait-states.
> But at least for my understanding, these transfer rates have nothign to 
> do with _high speed NOR flashes_ :-)
>
>
>   
>> Why do you think that?  The chip drivers don't do DMA, so all I/O goes
>> through the CPU.
>>     
>
> Yes, DMA is not used. However, the CPU should be strong enough to do 
> this transfer faster. On the other hand, my understanding is, DMA brings 
> no speed improvements in all cases. It boosts memory transfers without 
> adding an extra overhead to the CPU. But in this case, copy data is the 
> only task.
>
>
> Regards,
> Andre
>
>
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/
>
>   

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: flash read performance
  2008-10-30  8:33     ` Arnaud Mouiche
@ 2008-10-30  9:52       ` Andre Puschmann
  2008-10-30 10:06         ` Arnaud Mouiche
  0 siblings, 1 reply; 19+ messages in thread
From: Andre Puschmann @ 2008-10-30  9:52 UTC (permalink / raw)
  To: linux-mtd

Hi,

> I was faced with the same wondering in the past : bootloader NOR access 
> was really much faster that Linux one.

About how much faster? It really depends on the access method. I am
using u-boot and if I use the basic cp.b routine its about the same
_slow_ speed. I tried to use the asm-optimised memcpy routine that the 
kernel has. This is much faster, around 5MB/s.

> Yes, no DMA was used (but the same on bootloader, and anyway that 
> doesn't impact the data rate, only the CPU load), but even worse, Linux 
> code was using memcpy_fromio which a basic byte by byte loop copy in the 
> default ARM implementation.

Yes, memcpy_fromio is quite slow. But using normal memcpy is not 
suggested, only use writel()/readl() and memcpy_[from|to]io().

I am not sure about the right _fast_ way to to such copies.


Regards,
Andre

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: flash read performance
  2008-10-30  9:52       ` Andre Puschmann
@ 2008-10-30 10:06         ` Arnaud Mouiche
  2008-11-03 14:23           ` Andre Puschmann
  0 siblings, 1 reply; 19+ messages in thread
From: Arnaud Mouiche @ 2008-10-30 10:06 UTC (permalink / raw)
  To: Andre Puschmann; +Cc: linux-mtd

I was using redboot, configured to use the optimized memcpy (yes, it 
gives the choice at configuration time)
on kernel side, I just hack memcpy_fromio to add a "weak" attribute, and 
rewrite it to directly use the linux optimized memcpy (shame on me for 
this "not suggested" methode, but speed was my goal)

after that, performances are equal between bootloader and linux, and 
really near the one reached by a DMA access, which is also the 
performances we can calculate from FLASH time access configuration.

arnaud

Andre Puschmann a écrit :
> Hi,
>
>   
>> I was faced with the same wondering in the past : bootloader NOR access 
>> was really much faster that Linux one.
>>     
>
> About how much faster? It really depends on the access method. I am
> using u-boot and if I use the basic cp.b routine its about the same
> _slow_ speed. I tried to use the asm-optimised memcpy routine that the 
> kernel has. This is much faster, around 5MB/s.
>
>   
>> Yes, no DMA was used (but the same on bootloader, and anyway that 
>> doesn't impact the data rate, only the CPU load), but even worse, Linux 
>> code was using memcpy_fromio which a basic byte by byte loop copy in the 
>> default ARM implementation.
>>     
>
> Yes, memcpy_fromio is quite slow. But using normal memcpy is not 
> suggested, only use writel()/readl() and memcpy_[from|to]io().
>
> I am not sure about the right _fast_ way to to such copies.
>
>
> Regards,
> Andre
>
>
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/
>
>   

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: flash read performance
  2008-10-30 10:06         ` Arnaud Mouiche
@ 2008-11-03 14:23           ` Andre Puschmann
  2008-11-04  8:30             ` Andre Puschmann
  2008-11-04 11:42             ` Jamie Lokier
  0 siblings, 2 replies; 19+ messages in thread
From: Andre Puschmann @ 2008-11-03 14:23 UTC (permalink / raw)
  To: linux-mtd; +Cc: linux-mtd

Hi,

I spent some more time on this issue and investigated some mtd-maps
drivers.
The kernel I am using is a 2.6.21 that comes out of the gumstix 
svn-repo. Unfortunately, it uses a legacy driver which only does a
ioremap() but no ioremap_nocache(). Patching the driver with this
additional call boosts up transfers up to around 5.5MB/s, which
is a fairly improvement.
I will send a patch to the gumstix list. Users of newer kernel might
not need this, as they use a newer driver (pxa2xx-flash.c) anyway.

But I am wondering if things still can go faster?!
Jamie, do you some information about the speed I can expect 
theoretically? Or do I have to switch over to another operation
mode (i.e. async) for higher speeds?

Thanks in advance.

Best regards,
Andre



Arnaud Mouiche schrieb:
> I was using redboot, configured to use the optimized memcpy (yes, it 
> gives the choice at configuration time)
> on kernel side, I just hack memcpy_fromio to add a "weak" attribute, and 
> rewrite it to directly use the linux optimized memcpy (shame on me for 
> this "not suggested" methode, but speed was my goal)
> 
> after that, performances are equal between bootloader and linux, and 
> really near the one reached by a DMA access, which is also the 
> performances we can calculate from FLASH time access configuration.
> 
> arnaud
> 
> Andre Puschmann a écrit :
>> Hi,
>>
>>   
>>> I was faced with the same wondering in the past : bootloader NOR access 
>>> was really much faster that Linux one.
>>>     
>> About how much faster? It really depends on the access method. I am
>> using u-boot and if I use the basic cp.b routine its about the same
>> _slow_ speed. I tried to use the asm-optimised memcpy routine that the 
>> kernel has. This is much faster, around 5MB/s.
>>
>>   
>>> Yes, no DMA was used (but the same on bootloader, and anyway that 
>>> doesn't impact the data rate, only the CPU load), but even worse, Linux 
>>> code was using memcpy_fromio which a basic byte by byte loop copy in the 
>>> default ARM implementation.
>>>     
>> Yes, memcpy_fromio is quite slow. But using normal memcpy is not 
>> suggested, only use writel()/readl() and memcpy_[from|to]io().
>>
>> I am not sure about the right _fast_ way to to such copies.
>>
>>
>> Regards,
>> Andre
>>
>>
>> ______________________________________________________
>> Linux MTD discussion mailing list
>> http://lists.infradead.org/mailman/listinfo/linux-mtd/
>>
>>   
> 
> 
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: flash read performance
  2008-11-03 14:23           ` Andre Puschmann
@ 2008-11-04  8:30             ` Andre Puschmann
  2008-11-04 11:42             ` Jamie Lokier
  1 sibling, 0 replies; 19+ messages in thread
From: Andre Puschmann @ 2008-11-04  8:30 UTC (permalink / raw)
  To: linux-mtd

Hi,

Andre Puschmann schrieb:
> Unfortunately, it uses a legacy driver which only does a
> ioremap() but no ioremap_nocache().

Sorry, there was a typo in this statement.
It should be ioremap_cached().


Regards,
Andre

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: flash read performance
  2008-11-03 14:23           ` Andre Puschmann
  2008-11-04  8:30             ` Andre Puschmann
@ 2008-11-04 11:42             ` Jamie Lokier
  2008-11-04 14:31               ` Andre Puschman
  1 sibling, 1 reply; 19+ messages in thread
From: Jamie Lokier @ 2008-11-04 11:42 UTC (permalink / raw)
  To: Andre Puschmann; +Cc: Arnaud Mouiche, linux-mtd

Andre Puschmann wrote:
> Hi,
> 
> I spent some more time on this issue and investigated some mtd-maps
> drivers.
> The kernel I am using is a 2.6.21 that comes out of the gumstix 
> svn-repo. Unfortunately, it uses a legacy driver which only does a
> ioremap() but no ioremap_nocache(). Patching the driver with this
> additional call boosts up transfers up to around 5.5MB/s, which
> is a fairly improvement.
> I will send a patch to the gumstix list. Users of newer kernel might
> not need this, as they use a newer driver (pxa2xx-flash.c) anyway.

I don't know much about this area, but will _writing_ to the flash
work reliably if ioremap_cached() is used?

-- Jamie

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: flash read performance
  2008-11-04 11:42             ` Jamie Lokier
@ 2008-11-04 14:31               ` Andre Puschman
  2008-11-07  2:41                 ` Trent Piepho
  0 siblings, 1 reply; 19+ messages in thread
From: Andre Puschman @ 2008-11-04 14:31 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Arnaud Mouiche, linux-mtd

Jamie Lokier schrieb:
> I don't know much about this area, but will _writing_ to the flash
> work reliably if ioremap_cached() is used?
>
> -- Jamie
>   

Good point. I only was into reading and so I totally forgot writing ;-)
I gave it a try, although it was terribly slow (only a few kb/s), it worked.
I just did a cp uImage /dev/mtd3. On the other hand, I never tried writing
with the old driver. So I don't know if this is faster.

I also did some more testing with my improved flash-timing parameters,
which yields to read speeds of up to 18-19MB/s, which is really fast 
compared
to 1,3MB/s at the beginning :-)

So for now, this is my result:
- cache and well chosen flash-timings have great impact on (at least) 
read performance

But I think transfer rates > 20MB are possible ..

Anyway, with these results, booting the complete system in nearly (or 
even less than) 2s should be possible.

Regards,
Andre

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: flash read performance
  2008-11-04 14:31               ` Andre Puschman
@ 2008-11-07  2:41                 ` Trent Piepho
  2008-11-07  4:02                   ` Jamie Lokier
  2008-11-07  9:47                   ` Andre Puschmann
  0 siblings, 2 replies; 19+ messages in thread
From: Trent Piepho @ 2008-11-07  2:41 UTC (permalink / raw)
  To: Andre Puschman; +Cc: MTD mailing list, Arnaud Mouiche, Jamie Lokier

On Tue, 4 Nov 2008, Andre Puschman wrote:
> Jamie Lokier schrieb:
>> I don't know much about this area, but will _writing_ to the flash
>> work reliably if ioremap_cached() is used?
>>
>> -- Jamie
>>
>
> Good point. I only was into reading and so I totally forgot writing ;-)
> I gave it a try, although it was terribly slow (only a few kb/s), it worked.
> I just did a cp uImage /dev/mtd3. On the other hand, I never tried writing
> with the old driver. So I don't know if this is faster.

I've found that writes do not work with caching enabled.  When the CPU writes
to the flash and then reads it back, it gets returned what it wrote.  That's
not what is supposed to happen.  For example, to program flash word 'i' to
the value 'val' using the Spansion/AMD method, you do this:

flash[0] = 0xf0f0;
flash[0x555] = 0xaaaa;
flash[0x2aa] = 0x5555;
flash[0x555] = 0xa0a0;
flash[i] = val;
while(flash[i] != val); /* wait for it to finish */

After this flash[0] should be whatever data was there before, not 0xf0f0. 
Same with flash[0x555] and the rest.  Only flash[i] should be modified.  But
if flash is cached, the cpu will use the cached values and think flash[0] is
0xf0f0 until the cache gets flushed.

> I also did some more testing with my improved flash-timing parameters,
> which yields to read speeds of up to 18-19MB/s, which is really fast
> compared
> to 1,3MB/s at the beginning :-)

My results, from a mpc8572 (powerpc) with a spansion s96gl064n flash chip on a
100 MHz bus.

Mapping				Speed (MB/sec) (MB = 1048576 bytes)
un-cached and guarded		12.30
cached and gaurded		14.24
cached and un-guarded		14.31
un-cached and un-guarded	14.66

I measured by reading flash linearly from beginning to end 32-bits at a time. 
Since the flash is bigger than the cache, ever read should have come from the
flash.  If I just read the same 1k over and over that would obviously be much
faster if it could come from the cache.

I'm just using the GPCM mode of the Freescale eLBC, which means I have to use
the same timings both for writes and reads.  There are parts of the timing I
could make faster for reads, but then they would be too short for writes, and
vice versa.  It also means I can't use the page burst mode, which would
speed up reads significantly.

> Anyway, with these results, booting the complete system in nearly (or
> even less than) 2s should be possible.

The biggest bootup delay I have now is waiting for the ethernet phy to get
online, which takes almost 3 seconds.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: flash read performance
  2008-11-07  2:41                 ` Trent Piepho
@ 2008-11-07  4:02                   ` Jamie Lokier
  2008-11-07  5:36                     ` Trent Piepho
  2008-11-07  9:47                   ` Andre Puschmann
  1 sibling, 1 reply; 19+ messages in thread
From: Jamie Lokier @ 2008-11-07  4:02 UTC (permalink / raw)
  To: Trent Piepho; +Cc: Arnaud Mouiche, Andre Puschman, MTD mailing list

Trent Piepho wrote:
> On Tue, 4 Nov 2008, Andre Puschman wrote:
> > Jamie Lokier schrieb:
> >> I don't know much about this area, but will _writing_ to the flash
> >> work reliably if ioremap_cached() is used?
> >>
> > Good point. I only was into reading and so I totally forgot writing ;-)
> > I gave it a try, although it was terribly slow (only a few kb/s), it worked.
> > I just did a cp uImage /dev/mtd3. On the other hand, I never tried writing
> > with the old driver. So I don't know if this is faster.
> 
> I've found that writes do not work with caching enabled.  When the CPU writes
> to the flash and then reads it back, it gets returned what it wrote.

Thanks, both.

Based on Andre's observation, I will soon try enabling cache for my
NOR, and see if it makes a difference to cold-cache read performance.
I don't expect it, but it's worth a try.

If that helps significantly, then I'll look at doing writes properly.

> That's not what is supposed to happen.  For example, to program
> flash word 'i' to the value 'val' using the Spansion/AMD method, you
> do this:
> 
> flash[0] = 0xf0f0;
> flash[0x555] = 0xaaaa;
> flash[0x2aa] = 0x5555;
> flash[0x555] = 0xa0a0;
> flash[i] = val;
> while(flash[i] != val); /* wait for it to finish */
> 
> After this flash[0] should be whatever data was there before, not 0xf0f0. 
> Same with flash[0x555] and the rest.  Only flash[i] should be modified.  But
> if flash is cached, the cpu will use the cached values and think flash[0] is
> 0xf0f0 until the cache gets flushed.

You might also find the write operation to be unreliable, if the
caching mode is write-back rather than write-through.

Really, you should use an uncached mapping to write commands to the
flash, flush the cached mapping (for reads) when commands are written,
and prevent any access during the writes (this is in MTD normally).
You could optimise by flushing only the cached read regions which are
affected by write and erase commands.

> > I also did some more testing with my improved flash-timing parameters,
> > which yields to read speeds of up to 18-19MB/s, which is really fast
> > compared
> > to 1,3MB/s at the beginning :-)
> 
> My results, from a mpc8572 (powerpc) with a spansion s96gl064n flash
> chip on a 100 MHz bus.
> 
> Mapping				Speed (MB/sec) (MB = 1048576 bytes)
> un-cached and guarded		12.30
> cached and gaurded		14.24
> cached and un-guarded		14.31
> un-cached and un-guarded	14.66
> 
> I measured by reading flash linearly from beginning to end 32-bits at a time. 
> Since the flash is bigger than the cache, ever read should have come from the
> flash.  If I just read the same 1k over and over that would obviously be much
> faster if it could come from the cache.

That's nice to see that cache helps cold-read performance too, not
just cached reads.  Thanks :-)

> The biggest bootup delay I have now is waiting for the ethernet phy to get
> online, which takes almost 3 seconds.

Do you need to delay everything else for that, or can you parallelise?

-- Jamie

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: flash read performance
  2008-11-07  4:02                   ` Jamie Lokier
@ 2008-11-07  5:36                     ` Trent Piepho
  2008-11-07  5:57                       ` Jamie Lokier
  0 siblings, 1 reply; 19+ messages in thread
From: Trent Piepho @ 2008-11-07  5:36 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Arnaud Mouiche, Andre Puschman, MTD mailing list

On Fri, 7 Nov 2008, Jamie Lokier wrote:
> Based on Andre's observation, I will soon try enabling cache for my
> NOR, and see if it makes a difference to cold-cache read performance.
> I don't expect it, but it's worth a try.

It possible it could help by efficiently doing the reads back-to-back with
no wasted cycles between them.  I think it's necessary if you want to
benefit from page mode, but that's something I haven't tried yet.

> You might also find the write operation to be unreliable, if the
> caching mode is write-back rather than write-through.

Proper use of "sync" instructions, or whatever the arch uses to insure that
writel() is strictly ordered should fix that.

> Really, you should use an uncached mapping to write commands to the
> flash, flush the cached mapping (for reads) when commands are written,
> and prevent any access during the writes (this is in MTD normally).
> You could optimise by flushing only the cached read regions which are
> affected by write and erase commands.

Yes, that is probably the best.  Most NOR flash writing is so slow that the
cache flushes shouldn't be too expensive.

>> Mapping				Speed (MB/sec) (MB = 1048576 bytes)
>> un-cached and guarded		12.30
>> cached and gaurded		14.24
>> cached and un-guarded		14.31
>> un-cached and un-guarded	14.66
>>
>> I measured by reading flash linearly from beginning to end 32-bits at a time.
>> Since the flash is bigger than the cache, ever read should have come from the
>> flash.  If I just read the same 1k over and over that would obviously be much
>> faster if it could come from the cache.
>
> That's nice to see that cache helps cold-read performance too, not
> just cached reads.  Thanks :-)

Though if the mapping is not in guarded mode, turning cache on hurts
performance.  That surprises me.

>> The biggest bootup delay I have now is waiting for the ethernet phy to get
>> online, which takes almost 3 seconds.
>
> Do you need to delay everything else for that, or can you parallelise?

It's parallel.  I start the phy very early in the boot loader, and linux is
done booting and sitting in userspace waiting for it for a few hundred ms
before it's ready.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: flash read performance
  2008-11-07  5:36                     ` Trent Piepho
@ 2008-11-07  5:57                       ` Jamie Lokier
  0 siblings, 0 replies; 19+ messages in thread
From: Jamie Lokier @ 2008-11-07  5:57 UTC (permalink / raw)
  To: Trent Piepho; +Cc: Arnaud Mouiche, Andre Puschman, MTD mailing list

Trent Piepho wrote:
> On Fri, 7 Nov 2008, Jamie Lokier wrote:
> > Based on Andre's observation, I will soon try enabling cache for my
> > NOR, and see if it makes a difference to cold-cache read performance.
> > I don't expect it, but it's worth a try.
> 
> It possible it could help by efficiently doing the reads back-to-back with
> no wasted cycles between them.  I think it's necessary if you want to
> benefit from page mode, but that's something I haven't tried yet.
> 
> > You might also find the write operation to be unreliable, if the
> > caching mode is write-back rather than write-through.
> 
> Proper use of "sync" instructions, or whatever the arch uses to insure that
> writel() is strictly ordered should fix that.

I think that won't work because strongly ordered writes don't
translate directly to bus transactions when the region is mapped with
ioremap_cached().  The flash wants exact bus transactions.

-- Jamie

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: flash read performance
  2008-11-07  2:41                 ` Trent Piepho
  2008-11-07  4:02                   ` Jamie Lokier
@ 2008-11-07  9:47                   ` Andre Puschmann
  2008-11-08  5:28                     ` Trent Piepho
  1 sibling, 1 reply; 19+ messages in thread
From: Andre Puschmann @ 2008-11-07  9:47 UTC (permalink / raw)
  To: linux-mtd

Hi,

Trent Piepho wrote:
> I've found that writes do not work with caching enabled.  When the CPU writes
> to the flash and then reads it back, it gets returned what it wrote.  That's
> not what is supposed to happen.  

Sure, writes should be uncached and unbuffered. But I thought the mtd 
layer handles this correctly as there are two different ioremap's in the 
driver:

map.virt = ioremap(..);
map.cached = ioremap_cached(..);
map.inval_cache = inval_cache_fct();

So, calling inval_cache_fct() just before any write operation and then 
using the uncached mapping should do the trick, no? On the other hand, I 
am not sure if the mtd-layer really behaves like that. Can somebody 
confirm this?
Anyway, I tried to write a cramfs-image to a previously (in uboot) 
erased flash area. After that, I could successfully boot the system 
using this cramfs as my root. So I would reason that writes are OK.
Time is another point (cramfs_xip.bin is 1.6MB):

# time cp cramfs_xip.bin /dev/mtd3
real    4m 13.52s
user    0m 0.00s
sys     4m 13.03s

This is around 6,3kB/s. Doing the same write in uboot with cp.b takes 
about 26sec. So this is around 62kB/s.


> I measured by reading flash linearly from beginning to end 32-bits at a time. 
> Since the flash is bigger than the cache, ever read should have come from the
> flash.  If I just read the same 1k over and over that would obviously be much
> faster if it could come from the cache.

I measured by reading from mtd char device, which is not reading the 
same data over and over again.
I copied the whole partition into a ramdisk. mtd3 is 3MB in size so this 
yields to a read speed of 11.53MB/s

# time cp /dev/mtd3 /tmp/test
real    0m 0.26s
user    0m 0.01s
sys     0m 0.24s


> I'm just using the GPCM mode of the Freescale eLBC, which means I have to use
> the same timings both for writes and reads.  There are parts of the timing I
> could make faster for reads, but then they would be too short for writes, and
> vice versa.  It also means I can't use the page burst mode, which would
> speed up reads significantly.

I am not familiar with GPCM and eLBC, but it sounds about the same here. 
Same timings for writes and reads. But I use burst mode (4 words), but 
this only applies to reads.

> The biggest bootup delay I have now is waiting for the ethernet phy to get
> online, which takes almost 3 seconds.

Same here, ethernet phy takes sooo long. Here is what I do: I am using a 
parallel init. On script is just for loading the ethernet-module which 
is done next to the other scripts. So 2sec is cheated, cause ethernet 
isn't really available at this point of time.

Regards,
Andre

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: flash read performance
  2008-11-07  9:47                   ` Andre Puschmann
@ 2008-11-08  5:28                     ` Trent Piepho
  2008-11-11 13:28                       ` Andre Puschmann
  0 siblings, 1 reply; 19+ messages in thread
From: Trent Piepho @ 2008-11-08  5:28 UTC (permalink / raw)
  To: Andre Puschmann; +Cc: Jamie Lokier, MTD mailing list

On Fri, 7 Nov 2008, Andre Puschmann wrote:
> Trent Piepho wrote:
>> I've found that writes do not work with caching enabled.  When the CPU writes
>> to the flash and then reads it back, it gets returned what it wrote.  That's
>> not what is supposed to happen.
>
> Sure, writes should be uncached and unbuffered. But I thought the mtd
> layer handles this correctly as there are two different ioremap's in the
> driver:
>
> map.virt = ioremap(..);
> map.cached = ioremap_cached(..);
> map.inval_cache = inval_cache_fct();

It depends on what mapping driver you're using.  It looks like only the
pxa2xx driver uses map.cached.  The physmap or of_physmap drivers that I'm
using don't use it.

> So, calling inval_cache_fct() just before any write operation and then
> using the uncached mapping should do the trick, no? On the other hand, I
> am not sure if the mtd-layer really behaves like that. Can somebody
> confirm this?

>From looking at the code I'd say you're right.

> I am not familiar with GPCM and eLBC, but it sounds about the same here.
> Same timings for writes and reads. But I use burst mode (4 words), but
> this only applies to reads.

I've switched from GPCM to UPM, which lets me use different timings for
read and write as well as use burst mode.

In non-cached and guarded mode, I now get 13.61 vs 12.30 MB/s.  That's just
from slightly better timings because I could make them different for read
vs write.  The big difference is cached and non-guarded reads, which went
to 44.79 MB/s from 14.24 MB/s.  That boost is from using burst mode.

So the answer is yes, turning on cache can boost cold-cache performance, if
doing so lets you use page burst mode.  It makes a huge difference in fact!

>> The biggest bootup delay I have now is waiting for the ethernet phy to get
>> online, which takes almost 3 seconds.
>
> Same here, ethernet phy takes sooo long. Here is what I do: I am using a
> parallel init. On script is just for loading the ethernet-module which
> is done next to the other scripts. So 2sec is cheated, cause ethernet
> isn't really available at this point of time.

It might be the case that the ethernet module resets the PHY when it loads
and/or when the ethernet device is opened.  That was a problem I was
having.  The PHY would almost be done when the dhcp client would run and
open eth0, which would start the phy all over again.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: flash read performance
  2008-11-08  5:28                     ` Trent Piepho
@ 2008-11-11 13:28                       ` Andre Puschmann
  2008-11-15  2:02                         ` Trent Piepho
  0 siblings, 1 reply; 19+ messages in thread
From: Andre Puschmann @ 2008-11-11 13:28 UTC (permalink / raw)
  To: linux-mtd

Hi Trant,


Trent Piepho schrieb:
>> map.virt = ioremap(..);
>> map.cached = ioremap_cached(..);
>> map.inval_cache = inval_cache_fct();
> 
> It depends on what mapping driver you're using.  It looks like only the
> pxa2xx driver uses map.cached.  The physmap or of_physmap drivers that I'm
> using don't use it.

The flashmap-drive I am using is custom one made for the gumstix-board. 
However, it is almost identically to pxa2xx-driver from newer kernels.

> I've switched from GPCM to UPM, which lets me use different timings for
> read and write as well as use burst mode.
> 
> In non-cached and guarded mode, I now get 13.61 vs 12.30 MB/s.  That's just
> from slightly better timings because I could make them different for read
> vs write.  The big difference is cached and non-guarded reads, which went
> to 44.79 MB/s from 14.24 MB/s.  That boost is from using burst mode.

Whop, this is great news. Btw. do you drive your flash in asynchronous 
or in synchronous mode? Do you have an extra flash configuration 
register that you need to modify in order to use the burst mode?
My intel NOR flash has an extra read configuration register (RCR). 
However, for some reason I am not able to read/modify/read this register 
successfully.

  > It might be the case that the ethernet module resets the PHY when it 
loads
> and/or when the ethernet device is opened.  That was a problem I was
> having.  The PHY would almost be done when the dhcp client would run and
> open eth0, which would start the phy all over again.

Thanks for that hint. I'll downgrade this as a later task to do.

Regards,
Andre

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: flash read performance
  2008-11-11 13:28                       ` Andre Puschmann
@ 2008-11-15  2:02                         ` Trent Piepho
  0 siblings, 0 replies; 19+ messages in thread
From: Trent Piepho @ 2008-11-15  2:02 UTC (permalink / raw)
  To: Andre Puschmann; +Cc: linux-mtd

On Tue, 11 Nov 2008, Andre Puschmann wrote:
> Trent Piepho schrieb:
>>> map.virt = ioremap(..);
>>> map.cached = ioremap_cached(..);
>>> map.inval_cache = inval_cache_fct();
>>
>> It depends on what mapping driver you're using.  It looks like only the
>> pxa2xx driver uses map.cached.  The physmap or of_physmap drivers that I'm
>> using don't use it.
>
> The flashmap-drive I am using is custom one made for the gumstix-board.
> However, it is almost identically to pxa2xx-driver from newer kernels.

I added support to physmap_of for using map.cached on ppc32, seems to be
working so far.

But, it turns out it doesn't work for XIP.  The flash drivers that support XIP
implement a ->point() method that something like AXFS or cramfs+xip use to
mmap the flash.  But these just return pointers to the uncached mapping.

>> In non-cached and guarded mode, I now get 13.61 vs 12.30 MB/s.  That's just
>> from slightly better timings because I could make them different for read
>> vs write.  The big difference is cached and non-guarded reads, which went
>> to 44.79 MB/s from 14.24 MB/s.  That boost is from using burst mode.
>
> Whop, this is great news. Btw. do you drive your flash in asynchronous
> or in synchronous mode? Do you have an extra flash configuration
> register that you need to modify in order to use the burst mode?
> My intel NOR flash has an extra read configuration register (RCR).
> However, for some reason I am not able to read/modify/read this register
> successfully.

In asynchronous mode.  I didn't have to program anything special in the flash
chip, just program the localbus controller to use page burst transfers.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2008-11-15  2:05 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-28 10:14 flash read performance Andre Puschmann
2008-10-29 11:42 ` Josh Boyer
2008-10-29 12:03   ` Jamie Lokier
2008-10-29 15:52   ` Andre Puschmann
2008-10-30  8:33     ` Arnaud Mouiche
2008-10-30  9:52       ` Andre Puschmann
2008-10-30 10:06         ` Arnaud Mouiche
2008-11-03 14:23           ` Andre Puschmann
2008-11-04  8:30             ` Andre Puschmann
2008-11-04 11:42             ` Jamie Lokier
2008-11-04 14:31               ` Andre Puschman
2008-11-07  2:41                 ` Trent Piepho
2008-11-07  4:02                   ` Jamie Lokier
2008-11-07  5:36                     ` Trent Piepho
2008-11-07  5:57                       ` Jamie Lokier
2008-11-07  9:47                   ` Andre Puschmann
2008-11-08  5:28                     ` Trent Piepho
2008-11-11 13:28                       ` Andre Puschmann
2008-11-15  2:02                         ` Trent Piepho

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox