* flash read performance @ 2008-10-28 10:14 Andre Puschmann 2008-10-29 11:42 ` Josh Boyer 0 siblings, 1 reply; 19+ messages in thread From: Andre Puschmann @ 2008-10-28 10:14 UTC (permalink / raw) To: linux-mtd Hi list, I am currently trying to improve the flash read performance of my platform. It's a gumstix verdex board with a pxa270 running at 400MHz. My flash is a 16MB NOR Intel StrataFlash P30 (128P30T) and it's operating in the _normal_ asynchronous mode. In my opinion the read performance is very poor, only around 1.2 to 1.4 MB/s depending on the blocksize. I think it should be possible to get much higher transfer rates. In Linux, I ran my tests with dd like this (copy 10MB): time dd if=/dev/mtd5 of=/dev/null bs=16k count=640 640+0 records in 640+0 records out real 0m 7.17s user 0m 0.00s sys 0m 7.17s Running top in another console brings up, that the CPU load is very high during copy. I am not sure if the system is doing some sort of busy waiting or something like that? However, it should be possible to do a copy without having such a high load. Mem: 17684K used, 45144K free, 0K shrd, 0K buff, 11164K cached CPU: 0% usr 100% sys 0% nice 0% idle 0% io 0% irq 0% softirq Load average: 0.10 0.17 0.09 PID PPID USER STAT VSZ %MEM %CPU COMMAND 259 258 root R 1100 2% 95% dd if /dev/mtd5 of /dev/null bs 16k co I guess a number of people are using a similar/comparable setup. So some kind of user benchmark would be nice. I am sure this needs some more investigation, but any comment/hint is more than welcome. Best regards, Andre ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance 2008-10-28 10:14 flash read performance Andre Puschmann @ 2008-10-29 11:42 ` Josh Boyer 2008-10-29 12:03 ` Jamie Lokier 2008-10-29 15:52 ` Andre Puschmann 0 siblings, 2 replies; 19+ messages in thread From: Josh Boyer @ 2008-10-29 11:42 UTC (permalink / raw) To: Andre Puschmann; +Cc: linux-mtd On Tue, Oct 28, 2008 at 11:14:05AM +0100, Andre Puschmann wrote: >Hi list, > >I am currently trying to improve the flash read performance of my >platform. It's a gumstix verdex board with a pxa270 running at 400MHz. >My flash is a 16MB NOR Intel StrataFlash P30 (128P30T) and it's >operating in the _normal_ asynchronous mode. In my opinion the read >performance is very poor, only around 1.2 to 1.4 MB/s depending on the >blocksize. I think it should be possible to get much higher transfer rates. Why do you think that? >In Linux, I ran my tests with dd like this (copy 10MB): >time dd if=/dev/mtd5 of=/dev/null bs=16k count=640 >640+0 records in >640+0 records out >real 0m 7.17s >user 0m 0.00s >sys 0m 7.17s > > >Running top in another console brings up, that the CPU load is very high >during copy. I am not sure if the system is doing some sort of busy >waiting or something like that? However, it should be possible to do a >copy without having such a high load. Why do you think that? The chip drivers don't do DMA, so all I/O goes through the CPU. josh ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance 2008-10-29 11:42 ` Josh Boyer @ 2008-10-29 12:03 ` Jamie Lokier 2008-10-29 15:52 ` Andre Puschmann 1 sibling, 0 replies; 19+ messages in thread From: Jamie Lokier @ 2008-10-29 12:03 UTC (permalink / raw) To: Josh Boyer; +Cc: Andre Puschmann, linux-mtd Josh Boyer wrote: > On Tue, Oct 28, 2008 at 11:14:05AM +0100, Andre Puschmann wrote: > >Hi list, > > > >I am currently trying to improve the flash read performance of my > >platform. It's a gumstix verdex board with a pxa270 running at 400MHz. > >My flash is a 16MB NOR Intel StrataFlash P30 (128P30T) and it's > >operating in the _normal_ asynchronous mode. In my opinion the read > >performance is very poor, only around 1.2 to 1.4 MB/s depending on the > >blocksize. I think it should be possible to get much higher transfer rates. > > Why do you think that? Take a look at: http://marc.info/?l=linux-embedded&m=122125638419881&w=2 http://marc.info/?l=linux-embedded&m=122149685932525&w=2 http://marc.info/?l=linux-embedded&m=122185227511786&w=2 -- Jamie ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance 2008-10-29 11:42 ` Josh Boyer 2008-10-29 12:03 ` Jamie Lokier @ 2008-10-29 15:52 ` Andre Puschmann 2008-10-30 8:33 ` Arnaud Mouiche 1 sibling, 1 reply; 19+ messages in thread From: Andre Puschmann @ 2008-10-29 15:52 UTC (permalink / raw) To: linux-mtd Josh Boyer schrieb: >> In my opinion the read >> performance is very poor, only around 1.2 to 1.4 MB/s depending on the >> blocksize. I think it should be possible to get much higher transfer rates. > > Why do you think that? I guess there is something wrong with the timing parameters and/or the way the CPU core speaks to the flash controller, which results in long wait-states. But at least for my understanding, these transfer rates have nothign to do with _high speed NOR flashes_ :-) > Why do you think that? The chip drivers don't do DMA, so all I/O goes > through the CPU. Yes, DMA is not used. However, the CPU should be strong enough to do this transfer faster. On the other hand, my understanding is, DMA brings no speed improvements in all cases. It boosts memory transfers without adding an extra overhead to the CPU. But in this case, copy data is the only task. Regards, Andre ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance 2008-10-29 15:52 ` Andre Puschmann @ 2008-10-30 8:33 ` Arnaud Mouiche 2008-10-30 9:52 ` Andre Puschmann 0 siblings, 1 reply; 19+ messages in thread From: Arnaud Mouiche @ 2008-10-30 8:33 UTC (permalink / raw) To: linux-mtd Hi, I was faced with the same wondering in the past : bootloader NOR access was really much faster that Linux one. Yes, no DMA was used (but the same on bootloader, and anyway that doesn't impact the data rate, only the CPU load), but even worse, Linux code was using memcpy_fromio which a basic byte by byte loop copy in the default ARM implementation. May be your issues are the same... Regards, arnaud Andre Puschmann a écrit : > Josh Boyer schrieb: > >>> In my opinion the read >>> performance is very poor, only around 1.2 to 1.4 MB/s depending on the >>> blocksize. I think it should be possible to get much higher transfer rates. >>> >> Why do you think that? >> > > I guess there is something wrong with the timing parameters and/or the > way the CPU core speaks to the flash controller, which results in long > wait-states. > But at least for my understanding, these transfer rates have nothign to > do with _high speed NOR flashes_ :-) > > > >> Why do you think that? The chip drivers don't do DMA, so all I/O goes >> through the CPU. >> > > Yes, DMA is not used. However, the CPU should be strong enough to do > this transfer faster. On the other hand, my understanding is, DMA brings > no speed improvements in all cases. It boosts memory transfers without > adding an extra overhead to the CPU. But in this case, copy data is the > only task. > > > Regards, > Andre > > > ______________________________________________________ > Linux MTD discussion mailing list > http://lists.infradead.org/mailman/listinfo/linux-mtd/ > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance 2008-10-30 8:33 ` Arnaud Mouiche @ 2008-10-30 9:52 ` Andre Puschmann 2008-10-30 10:06 ` Arnaud Mouiche 0 siblings, 1 reply; 19+ messages in thread From: Andre Puschmann @ 2008-10-30 9:52 UTC (permalink / raw) To: linux-mtd Hi, > I was faced with the same wondering in the past : bootloader NOR access > was really much faster that Linux one. About how much faster? It really depends on the access method. I am using u-boot and if I use the basic cp.b routine its about the same _slow_ speed. I tried to use the asm-optimised memcpy routine that the kernel has. This is much faster, around 5MB/s. > Yes, no DMA was used (but the same on bootloader, and anyway that > doesn't impact the data rate, only the CPU load), but even worse, Linux > code was using memcpy_fromio which a basic byte by byte loop copy in the > default ARM implementation. Yes, memcpy_fromio is quite slow. But using normal memcpy is not suggested, only use writel()/readl() and memcpy_[from|to]io(). I am not sure about the right _fast_ way to to such copies. Regards, Andre ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance 2008-10-30 9:52 ` Andre Puschmann @ 2008-10-30 10:06 ` Arnaud Mouiche 2008-11-03 14:23 ` Andre Puschmann 0 siblings, 1 reply; 19+ messages in thread From: Arnaud Mouiche @ 2008-10-30 10:06 UTC (permalink / raw) To: Andre Puschmann; +Cc: linux-mtd I was using redboot, configured to use the optimized memcpy (yes, it gives the choice at configuration time) on kernel side, I just hack memcpy_fromio to add a "weak" attribute, and rewrite it to directly use the linux optimized memcpy (shame on me for this "not suggested" methode, but speed was my goal) after that, performances are equal between bootloader and linux, and really near the one reached by a DMA access, which is also the performances we can calculate from FLASH time access configuration. arnaud Andre Puschmann a écrit : > Hi, > > >> I was faced with the same wondering in the past : bootloader NOR access >> was really much faster that Linux one. >> > > About how much faster? It really depends on the access method. I am > using u-boot and if I use the basic cp.b routine its about the same > _slow_ speed. I tried to use the asm-optimised memcpy routine that the > kernel has. This is much faster, around 5MB/s. > > >> Yes, no DMA was used (but the same on bootloader, and anyway that >> doesn't impact the data rate, only the CPU load), but even worse, Linux >> code was using memcpy_fromio which a basic byte by byte loop copy in the >> default ARM implementation. >> > > Yes, memcpy_fromio is quite slow. But using normal memcpy is not > suggested, only use writel()/readl() and memcpy_[from|to]io(). > > I am not sure about the right _fast_ way to to such copies. > > > Regards, > Andre > > > ______________________________________________________ > Linux MTD discussion mailing list > http://lists.infradead.org/mailman/listinfo/linux-mtd/ > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance 2008-10-30 10:06 ` Arnaud Mouiche @ 2008-11-03 14:23 ` Andre Puschmann 2008-11-04 8:30 ` Andre Puschmann 2008-11-04 11:42 ` Jamie Lokier 0 siblings, 2 replies; 19+ messages in thread From: Andre Puschmann @ 2008-11-03 14:23 UTC (permalink / raw) To: linux-mtd; +Cc: linux-mtd Hi, I spent some more time on this issue and investigated some mtd-maps drivers. The kernel I am using is a 2.6.21 that comes out of the gumstix svn-repo. Unfortunately, it uses a legacy driver which only does a ioremap() but no ioremap_nocache(). Patching the driver with this additional call boosts up transfers up to around 5.5MB/s, which is a fairly improvement. I will send a patch to the gumstix list. Users of newer kernel might not need this, as they use a newer driver (pxa2xx-flash.c) anyway. But I am wondering if things still can go faster?! Jamie, do you some information about the speed I can expect theoretically? Or do I have to switch over to another operation mode (i.e. async) for higher speeds? Thanks in advance. Best regards, Andre Arnaud Mouiche schrieb: > I was using redboot, configured to use the optimized memcpy (yes, it > gives the choice at configuration time) > on kernel side, I just hack memcpy_fromio to add a "weak" attribute, and > rewrite it to directly use the linux optimized memcpy (shame on me for > this "not suggested" methode, but speed was my goal) > > after that, performances are equal between bootloader and linux, and > really near the one reached by a DMA access, which is also the > performances we can calculate from FLASH time access configuration. > > arnaud > > Andre Puschmann a écrit : >> Hi, >> >> >>> I was faced with the same wondering in the past : bootloader NOR access >>> was really much faster that Linux one. >>> >> About how much faster? It really depends on the access method. I am >> using u-boot and if I use the basic cp.b routine its about the same >> _slow_ speed. I tried to use the asm-optimised memcpy routine that the >> kernel has. This is much faster, around 5MB/s. >> >> >>> Yes, no DMA was used (but the same on bootloader, and anyway that >>> doesn't impact the data rate, only the CPU load), but even worse, Linux >>> code was using memcpy_fromio which a basic byte by byte loop copy in the >>> default ARM implementation. >>> >> Yes, memcpy_fromio is quite slow. But using normal memcpy is not >> suggested, only use writel()/readl() and memcpy_[from|to]io(). >> >> I am not sure about the right _fast_ way to to such copies. >> >> >> Regards, >> Andre >> >> >> ______________________________________________________ >> Linux MTD discussion mailing list >> http://lists.infradead.org/mailman/listinfo/linux-mtd/ >> >> > > > ______________________________________________________ > Linux MTD discussion mailing list > http://lists.infradead.org/mailman/listinfo/linux-mtd/ > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance 2008-11-03 14:23 ` Andre Puschmann @ 2008-11-04 8:30 ` Andre Puschmann 2008-11-04 11:42 ` Jamie Lokier 1 sibling, 0 replies; 19+ messages in thread From: Andre Puschmann @ 2008-11-04 8:30 UTC (permalink / raw) To: linux-mtd Hi, Andre Puschmann schrieb: > Unfortunately, it uses a legacy driver which only does a > ioremap() but no ioremap_nocache(). Sorry, there was a typo in this statement. It should be ioremap_cached(). Regards, Andre ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance 2008-11-03 14:23 ` Andre Puschmann 2008-11-04 8:30 ` Andre Puschmann @ 2008-11-04 11:42 ` Jamie Lokier 2008-11-04 14:31 ` Andre Puschman 1 sibling, 1 reply; 19+ messages in thread From: Jamie Lokier @ 2008-11-04 11:42 UTC (permalink / raw) To: Andre Puschmann; +Cc: Arnaud Mouiche, linux-mtd Andre Puschmann wrote: > Hi, > > I spent some more time on this issue and investigated some mtd-maps > drivers. > The kernel I am using is a 2.6.21 that comes out of the gumstix > svn-repo. Unfortunately, it uses a legacy driver which only does a > ioremap() but no ioremap_nocache(). Patching the driver with this > additional call boosts up transfers up to around 5.5MB/s, which > is a fairly improvement. > I will send a patch to the gumstix list. Users of newer kernel might > not need this, as they use a newer driver (pxa2xx-flash.c) anyway. I don't know much about this area, but will _writing_ to the flash work reliably if ioremap_cached() is used? -- Jamie ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance 2008-11-04 11:42 ` Jamie Lokier @ 2008-11-04 14:31 ` Andre Puschman 2008-11-07 2:41 ` Trent Piepho 0 siblings, 1 reply; 19+ messages in thread From: Andre Puschman @ 2008-11-04 14:31 UTC (permalink / raw) To: Jamie Lokier; +Cc: Arnaud Mouiche, linux-mtd Jamie Lokier schrieb: > I don't know much about this area, but will _writing_ to the flash > work reliably if ioremap_cached() is used? > > -- Jamie > Good point. I only was into reading and so I totally forgot writing ;-) I gave it a try, although it was terribly slow (only a few kb/s), it worked. I just did a cp uImage /dev/mtd3. On the other hand, I never tried writing with the old driver. So I don't know if this is faster. I also did some more testing with my improved flash-timing parameters, which yields to read speeds of up to 18-19MB/s, which is really fast compared to 1,3MB/s at the beginning :-) So for now, this is my result: - cache and well chosen flash-timings have great impact on (at least) read performance But I think transfer rates > 20MB are possible .. Anyway, with these results, booting the complete system in nearly (or even less than) 2s should be possible. Regards, Andre ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance 2008-11-04 14:31 ` Andre Puschman @ 2008-11-07 2:41 ` Trent Piepho 2008-11-07 4:02 ` Jamie Lokier 2008-11-07 9:47 ` Andre Puschmann 0 siblings, 2 replies; 19+ messages in thread From: Trent Piepho @ 2008-11-07 2:41 UTC (permalink / raw) To: Andre Puschman; +Cc: MTD mailing list, Arnaud Mouiche, Jamie Lokier On Tue, 4 Nov 2008, Andre Puschman wrote: > Jamie Lokier schrieb: >> I don't know much about this area, but will _writing_ to the flash >> work reliably if ioremap_cached() is used? >> >> -- Jamie >> > > Good point. I only was into reading and so I totally forgot writing ;-) > I gave it a try, although it was terribly slow (only a few kb/s), it worked. > I just did a cp uImage /dev/mtd3. On the other hand, I never tried writing > with the old driver. So I don't know if this is faster. I've found that writes do not work with caching enabled. When the CPU writes to the flash and then reads it back, it gets returned what it wrote. That's not what is supposed to happen. For example, to program flash word 'i' to the value 'val' using the Spansion/AMD method, you do this: flash[0] = 0xf0f0; flash[0x555] = 0xaaaa; flash[0x2aa] = 0x5555; flash[0x555] = 0xa0a0; flash[i] = val; while(flash[i] != val); /* wait for it to finish */ After this flash[0] should be whatever data was there before, not 0xf0f0. Same with flash[0x555] and the rest. Only flash[i] should be modified. But if flash is cached, the cpu will use the cached values and think flash[0] is 0xf0f0 until the cache gets flushed. > I also did some more testing with my improved flash-timing parameters, > which yields to read speeds of up to 18-19MB/s, which is really fast > compared > to 1,3MB/s at the beginning :-) My results, from a mpc8572 (powerpc) with a spansion s96gl064n flash chip on a 100 MHz bus. Mapping Speed (MB/sec) (MB = 1048576 bytes) un-cached and guarded 12.30 cached and gaurded 14.24 cached and un-guarded 14.31 un-cached and un-guarded 14.66 I measured by reading flash linearly from beginning to end 32-bits at a time. Since the flash is bigger than the cache, ever read should have come from the flash. If I just read the same 1k over and over that would obviously be much faster if it could come from the cache. I'm just using the GPCM mode of the Freescale eLBC, which means I have to use the same timings both for writes and reads. There are parts of the timing I could make faster for reads, but then they would be too short for writes, and vice versa. It also means I can't use the page burst mode, which would speed up reads significantly. > Anyway, with these results, booting the complete system in nearly (or > even less than) 2s should be possible. The biggest bootup delay I have now is waiting for the ethernet phy to get online, which takes almost 3 seconds. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance 2008-11-07 2:41 ` Trent Piepho @ 2008-11-07 4:02 ` Jamie Lokier 2008-11-07 5:36 ` Trent Piepho 2008-11-07 9:47 ` Andre Puschmann 1 sibling, 1 reply; 19+ messages in thread From: Jamie Lokier @ 2008-11-07 4:02 UTC (permalink / raw) To: Trent Piepho; +Cc: Arnaud Mouiche, Andre Puschman, MTD mailing list Trent Piepho wrote: > On Tue, 4 Nov 2008, Andre Puschman wrote: > > Jamie Lokier schrieb: > >> I don't know much about this area, but will _writing_ to the flash > >> work reliably if ioremap_cached() is used? > >> > > Good point. I only was into reading and so I totally forgot writing ;-) > > I gave it a try, although it was terribly slow (only a few kb/s), it worked. > > I just did a cp uImage /dev/mtd3. On the other hand, I never tried writing > > with the old driver. So I don't know if this is faster. > > I've found that writes do not work with caching enabled. When the CPU writes > to the flash and then reads it back, it gets returned what it wrote. Thanks, both. Based on Andre's observation, I will soon try enabling cache for my NOR, and see if it makes a difference to cold-cache read performance. I don't expect it, but it's worth a try. If that helps significantly, then I'll look at doing writes properly. > That's not what is supposed to happen. For example, to program > flash word 'i' to the value 'val' using the Spansion/AMD method, you > do this: > > flash[0] = 0xf0f0; > flash[0x555] = 0xaaaa; > flash[0x2aa] = 0x5555; > flash[0x555] = 0xa0a0; > flash[i] = val; > while(flash[i] != val); /* wait for it to finish */ > > After this flash[0] should be whatever data was there before, not 0xf0f0. > Same with flash[0x555] and the rest. Only flash[i] should be modified. But > if flash is cached, the cpu will use the cached values and think flash[0] is > 0xf0f0 until the cache gets flushed. You might also find the write operation to be unreliable, if the caching mode is write-back rather than write-through. Really, you should use an uncached mapping to write commands to the flash, flush the cached mapping (for reads) when commands are written, and prevent any access during the writes (this is in MTD normally). You could optimise by flushing only the cached read regions which are affected by write and erase commands. > > I also did some more testing with my improved flash-timing parameters, > > which yields to read speeds of up to 18-19MB/s, which is really fast > > compared > > to 1,3MB/s at the beginning :-) > > My results, from a mpc8572 (powerpc) with a spansion s96gl064n flash > chip on a 100 MHz bus. > > Mapping Speed (MB/sec) (MB = 1048576 bytes) > un-cached and guarded 12.30 > cached and gaurded 14.24 > cached and un-guarded 14.31 > un-cached and un-guarded 14.66 > > I measured by reading flash linearly from beginning to end 32-bits at a time. > Since the flash is bigger than the cache, ever read should have come from the > flash. If I just read the same 1k over and over that would obviously be much > faster if it could come from the cache. That's nice to see that cache helps cold-read performance too, not just cached reads. Thanks :-) > The biggest bootup delay I have now is waiting for the ethernet phy to get > online, which takes almost 3 seconds. Do you need to delay everything else for that, or can you parallelise? -- Jamie ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance 2008-11-07 4:02 ` Jamie Lokier @ 2008-11-07 5:36 ` Trent Piepho 2008-11-07 5:57 ` Jamie Lokier 0 siblings, 1 reply; 19+ messages in thread From: Trent Piepho @ 2008-11-07 5:36 UTC (permalink / raw) To: Jamie Lokier; +Cc: Arnaud Mouiche, Andre Puschman, MTD mailing list On Fri, 7 Nov 2008, Jamie Lokier wrote: > Based on Andre's observation, I will soon try enabling cache for my > NOR, and see if it makes a difference to cold-cache read performance. > I don't expect it, but it's worth a try. It possible it could help by efficiently doing the reads back-to-back with no wasted cycles between them. I think it's necessary if you want to benefit from page mode, but that's something I haven't tried yet. > You might also find the write operation to be unreliable, if the > caching mode is write-back rather than write-through. Proper use of "sync" instructions, or whatever the arch uses to insure that writel() is strictly ordered should fix that. > Really, you should use an uncached mapping to write commands to the > flash, flush the cached mapping (for reads) when commands are written, > and prevent any access during the writes (this is in MTD normally). > You could optimise by flushing only the cached read regions which are > affected by write and erase commands. Yes, that is probably the best. Most NOR flash writing is so slow that the cache flushes shouldn't be too expensive. >> Mapping Speed (MB/sec) (MB = 1048576 bytes) >> un-cached and guarded 12.30 >> cached and gaurded 14.24 >> cached and un-guarded 14.31 >> un-cached and un-guarded 14.66 >> >> I measured by reading flash linearly from beginning to end 32-bits at a time. >> Since the flash is bigger than the cache, ever read should have come from the >> flash. If I just read the same 1k over and over that would obviously be much >> faster if it could come from the cache. > > That's nice to see that cache helps cold-read performance too, not > just cached reads. Thanks :-) Though if the mapping is not in guarded mode, turning cache on hurts performance. That surprises me. >> The biggest bootup delay I have now is waiting for the ethernet phy to get >> online, which takes almost 3 seconds. > > Do you need to delay everything else for that, or can you parallelise? It's parallel. I start the phy very early in the boot loader, and linux is done booting and sitting in userspace waiting for it for a few hundred ms before it's ready. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance 2008-11-07 5:36 ` Trent Piepho @ 2008-11-07 5:57 ` Jamie Lokier 0 siblings, 0 replies; 19+ messages in thread From: Jamie Lokier @ 2008-11-07 5:57 UTC (permalink / raw) To: Trent Piepho; +Cc: Arnaud Mouiche, Andre Puschman, MTD mailing list Trent Piepho wrote: > On Fri, 7 Nov 2008, Jamie Lokier wrote: > > Based on Andre's observation, I will soon try enabling cache for my > > NOR, and see if it makes a difference to cold-cache read performance. > > I don't expect it, but it's worth a try. > > It possible it could help by efficiently doing the reads back-to-back with > no wasted cycles between them. I think it's necessary if you want to > benefit from page mode, but that's something I haven't tried yet. > > > You might also find the write operation to be unreliable, if the > > caching mode is write-back rather than write-through. > > Proper use of "sync" instructions, or whatever the arch uses to insure that > writel() is strictly ordered should fix that. I think that won't work because strongly ordered writes don't translate directly to bus transactions when the region is mapped with ioremap_cached(). The flash wants exact bus transactions. -- Jamie ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance 2008-11-07 2:41 ` Trent Piepho 2008-11-07 4:02 ` Jamie Lokier @ 2008-11-07 9:47 ` Andre Puschmann 2008-11-08 5:28 ` Trent Piepho 1 sibling, 1 reply; 19+ messages in thread From: Andre Puschmann @ 2008-11-07 9:47 UTC (permalink / raw) To: linux-mtd Hi, Trent Piepho wrote: > I've found that writes do not work with caching enabled. When the CPU writes > to the flash and then reads it back, it gets returned what it wrote. That's > not what is supposed to happen. Sure, writes should be uncached and unbuffered. But I thought the mtd layer handles this correctly as there are two different ioremap's in the driver: map.virt = ioremap(..); map.cached = ioremap_cached(..); map.inval_cache = inval_cache_fct(); So, calling inval_cache_fct() just before any write operation and then using the uncached mapping should do the trick, no? On the other hand, I am not sure if the mtd-layer really behaves like that. Can somebody confirm this? Anyway, I tried to write a cramfs-image to a previously (in uboot) erased flash area. After that, I could successfully boot the system using this cramfs as my root. So I would reason that writes are OK. Time is another point (cramfs_xip.bin is 1.6MB): # time cp cramfs_xip.bin /dev/mtd3 real 4m 13.52s user 0m 0.00s sys 4m 13.03s This is around 6,3kB/s. Doing the same write in uboot with cp.b takes about 26sec. So this is around 62kB/s. > I measured by reading flash linearly from beginning to end 32-bits at a time. > Since the flash is bigger than the cache, ever read should have come from the > flash. If I just read the same 1k over and over that would obviously be much > faster if it could come from the cache. I measured by reading from mtd char device, which is not reading the same data over and over again. I copied the whole partition into a ramdisk. mtd3 is 3MB in size so this yields to a read speed of 11.53MB/s # time cp /dev/mtd3 /tmp/test real 0m 0.26s user 0m 0.01s sys 0m 0.24s > I'm just using the GPCM mode of the Freescale eLBC, which means I have to use > the same timings both for writes and reads. There are parts of the timing I > could make faster for reads, but then they would be too short for writes, and > vice versa. It also means I can't use the page burst mode, which would > speed up reads significantly. I am not familiar with GPCM and eLBC, but it sounds about the same here. Same timings for writes and reads. But I use burst mode (4 words), but this only applies to reads. > The biggest bootup delay I have now is waiting for the ethernet phy to get > online, which takes almost 3 seconds. Same here, ethernet phy takes sooo long. Here is what I do: I am using a parallel init. On script is just for loading the ethernet-module which is done next to the other scripts. So 2sec is cheated, cause ethernet isn't really available at this point of time. Regards, Andre ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance 2008-11-07 9:47 ` Andre Puschmann @ 2008-11-08 5:28 ` Trent Piepho 2008-11-11 13:28 ` Andre Puschmann 0 siblings, 1 reply; 19+ messages in thread From: Trent Piepho @ 2008-11-08 5:28 UTC (permalink / raw) To: Andre Puschmann; +Cc: Jamie Lokier, MTD mailing list On Fri, 7 Nov 2008, Andre Puschmann wrote: > Trent Piepho wrote: >> I've found that writes do not work with caching enabled. When the CPU writes >> to the flash and then reads it back, it gets returned what it wrote. That's >> not what is supposed to happen. > > Sure, writes should be uncached and unbuffered. But I thought the mtd > layer handles this correctly as there are two different ioremap's in the > driver: > > map.virt = ioremap(..); > map.cached = ioremap_cached(..); > map.inval_cache = inval_cache_fct(); It depends on what mapping driver you're using. It looks like only the pxa2xx driver uses map.cached. The physmap or of_physmap drivers that I'm using don't use it. > So, calling inval_cache_fct() just before any write operation and then > using the uncached mapping should do the trick, no? On the other hand, I > am not sure if the mtd-layer really behaves like that. Can somebody > confirm this? >From looking at the code I'd say you're right. > I am not familiar with GPCM and eLBC, but it sounds about the same here. > Same timings for writes and reads. But I use burst mode (4 words), but > this only applies to reads. I've switched from GPCM to UPM, which lets me use different timings for read and write as well as use burst mode. In non-cached and guarded mode, I now get 13.61 vs 12.30 MB/s. That's just from slightly better timings because I could make them different for read vs write. The big difference is cached and non-guarded reads, which went to 44.79 MB/s from 14.24 MB/s. That boost is from using burst mode. So the answer is yes, turning on cache can boost cold-cache performance, if doing so lets you use page burst mode. It makes a huge difference in fact! >> The biggest bootup delay I have now is waiting for the ethernet phy to get >> online, which takes almost 3 seconds. > > Same here, ethernet phy takes sooo long. Here is what I do: I am using a > parallel init. On script is just for loading the ethernet-module which > is done next to the other scripts. So 2sec is cheated, cause ethernet > isn't really available at this point of time. It might be the case that the ethernet module resets the PHY when it loads and/or when the ethernet device is opened. That was a problem I was having. The PHY would almost be done when the dhcp client would run and open eth0, which would start the phy all over again. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance 2008-11-08 5:28 ` Trent Piepho @ 2008-11-11 13:28 ` Andre Puschmann 2008-11-15 2:02 ` Trent Piepho 0 siblings, 1 reply; 19+ messages in thread From: Andre Puschmann @ 2008-11-11 13:28 UTC (permalink / raw) To: linux-mtd Hi Trant, Trent Piepho schrieb: >> map.virt = ioremap(..); >> map.cached = ioremap_cached(..); >> map.inval_cache = inval_cache_fct(); > > It depends on what mapping driver you're using. It looks like only the > pxa2xx driver uses map.cached. The physmap or of_physmap drivers that I'm > using don't use it. The flashmap-drive I am using is custom one made for the gumstix-board. However, it is almost identically to pxa2xx-driver from newer kernels. > I've switched from GPCM to UPM, which lets me use different timings for > read and write as well as use burst mode. > > In non-cached and guarded mode, I now get 13.61 vs 12.30 MB/s. That's just > from slightly better timings because I could make them different for read > vs write. The big difference is cached and non-guarded reads, which went > to 44.79 MB/s from 14.24 MB/s. That boost is from using burst mode. Whop, this is great news. Btw. do you drive your flash in asynchronous or in synchronous mode? Do you have an extra flash configuration register that you need to modify in order to use the burst mode? My intel NOR flash has an extra read configuration register (RCR). However, for some reason I am not able to read/modify/read this register successfully. > It might be the case that the ethernet module resets the PHY when it loads > and/or when the ethernet device is opened. That was a problem I was > having. The PHY would almost be done when the dhcp client would run and > open eth0, which would start the phy all over again. Thanks for that hint. I'll downgrade this as a later task to do. Regards, Andre ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: flash read performance 2008-11-11 13:28 ` Andre Puschmann @ 2008-11-15 2:02 ` Trent Piepho 0 siblings, 0 replies; 19+ messages in thread From: Trent Piepho @ 2008-11-15 2:02 UTC (permalink / raw) To: Andre Puschmann; +Cc: linux-mtd On Tue, 11 Nov 2008, Andre Puschmann wrote: > Trent Piepho schrieb: >>> map.virt = ioremap(..); >>> map.cached = ioremap_cached(..); >>> map.inval_cache = inval_cache_fct(); >> >> It depends on what mapping driver you're using. It looks like only the >> pxa2xx driver uses map.cached. The physmap or of_physmap drivers that I'm >> using don't use it. > > The flashmap-drive I am using is custom one made for the gumstix-board. > However, it is almost identically to pxa2xx-driver from newer kernels. I added support to physmap_of for using map.cached on ppc32, seems to be working so far. But, it turns out it doesn't work for XIP. The flash drivers that support XIP implement a ->point() method that something like AXFS or cramfs+xip use to mmap the flash. But these just return pointers to the uncached mapping. >> In non-cached and guarded mode, I now get 13.61 vs 12.30 MB/s. That's just >> from slightly better timings because I could make them different for read >> vs write. The big difference is cached and non-guarded reads, which went >> to 44.79 MB/s from 14.24 MB/s. That boost is from using burst mode. > > Whop, this is great news. Btw. do you drive your flash in asynchronous > or in synchronous mode? Do you have an extra flash configuration > register that you need to modify in order to use the burst mode? > My intel NOR flash has an extra read configuration register (RCR). > However, for some reason I am not able to read/modify/read this register > successfully. In asynchronous mode. I didn't have to program anything special in the flash chip, just program the localbus controller to use page burst transfers. ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2008-11-15 2:05 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-10-28 10:14 flash read performance Andre Puschmann 2008-10-29 11:42 ` Josh Boyer 2008-10-29 12:03 ` Jamie Lokier 2008-10-29 15:52 ` Andre Puschmann 2008-10-30 8:33 ` Arnaud Mouiche 2008-10-30 9:52 ` Andre Puschmann 2008-10-30 10:06 ` Arnaud Mouiche 2008-11-03 14:23 ` Andre Puschmann 2008-11-04 8:30 ` Andre Puschmann 2008-11-04 11:42 ` Jamie Lokier 2008-11-04 14:31 ` Andre Puschman 2008-11-07 2:41 ` Trent Piepho 2008-11-07 4:02 ` Jamie Lokier 2008-11-07 5:36 ` Trent Piepho 2008-11-07 5:57 ` Jamie Lokier 2008-11-07 9:47 ` Andre Puschmann 2008-11-08 5:28 ` Trent Piepho 2008-11-11 13:28 ` Andre Puschmann 2008-11-15 2:02 ` Trent Piepho
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox