* [U-Boot] i.MX51: FEC: Cache coherency problem?
@ 2011-07-18 15:18 David Jander
2011-07-18 16:16 ` Aneesh V
2011-07-18 16:55 ` Stefano Babic
0 siblings, 2 replies; 36+ messages in thread
From: David Jander @ 2011-07-18 15:18 UTC (permalink / raw)
To: u-boot
Hi all,
I am busy debugging a problem with the i.MX51 FEC ethernet driver, that
stopped working after upgrading u-boot. Before the upgrade I used
v2010.06-rc3, which worked fine.
I gave up on trying to find the difference beween the old version and this one
that broke it.... due to BSP issues, git bisecting seems a monumental task I
am not considering yet.
The funny part is that it seems to work fine if I disable data-caches!
With data caches enabled, it hangs in the following while loop in fec_send(),
at line 592:
...
584 /*
585 * Enable SmartDMA transmit task
586 */
587 fec_tx_task_enable(fec);
588
589 /*
590 * wait until frame is sent .
591 */
592 while (readw(&fec->tbd_base[fec->tbd_index].status) & FEC_TBD_READY) {
593 udelay(1);
594 }
...
If I change this code in the following way, the while loop exits successfully:
...
584 /*
585 * Enable SmartDMA transmit task
586 */
587 flush_cache(&fec->tbd_base[fec->tbd_index], 4);
588 fec_tx_task_enable(fec);
589 flash_dcache_all();
590
591 /*
592 * wait until frame is sent .
593 */
594 while (readw(&fec->tbd_base[fec->tbd_index].status) & FEC_TBD_READY) {
595 udelay(1);
596 }
...
Notice the cache flush calls at line 587 and 589. With these, sending
succeeds. The driver still hangs in receiving afterwards, but at least this
part seems to work. If I remove either of the two added lines, it stops
working again.
What is going on here? Why did this work with caches enabled before??
Best regards,
--
David Jander
Protonic Holland.
^ permalink raw reply [flat|nested] 36+ messages in thread* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-18 15:18 [U-Boot] i.MX51: FEC: Cache coherency problem? David Jander @ 2011-07-18 16:16 ` Aneesh V 2011-07-19 7:26 ` David Jander 2011-07-19 11:07 ` Matthias Weißer 2011-07-18 16:55 ` Stefano Babic 1 sibling, 2 replies; 36+ messages in thread From: Aneesh V @ 2011-07-18 16:16 UTC (permalink / raw) To: u-boot Hi David, On Monday 18 July 2011 08:48 PM, David Jander wrote: > > Hi all, > > I am busy debugging a problem with the i.MX51 FEC ethernet driver, that > stopped working after upgrading u-boot. Before the upgrade I used > v2010.06-rc3, which worked fine. > I gave up on trying to find the difference beween the old version and this one > that broke it.... due to BSP issues, git bisecting seems a monumental task I > am not considering yet. > The funny part is that it seems to work fine if I disable data-caches! > With data caches enabled, it hangs in the following while loop in fec_send(), > at line 592: > > ... > 584 /* > 585 * Enable SmartDMA transmit task > 586 */ > 587 fec_tx_task_enable(fec); > 588 > 589 /* > 590 * wait until frame is sent . > 591 */ > 592 while (readw(&fec->tbd_base[fec->tbd_index].status)& FEC_TBD_READY) { > 593 udelay(1); > 594 } > ... > > If I change this code in the following way, the while loop exits successfully: > > ... > 584 /* > 585 * Enable SmartDMA transmit task > 586 */ > 587 flush_cache(&fec->tbd_base[fec->tbd_index], 4); > 588 fec_tx_task_enable(fec); > 589 flash_dcache_all(); > 590 > 591 /* > 592 * wait until frame is sent . > 593 */ > 594 while (readw(&fec->tbd_base[fec->tbd_index].status)& FEC_TBD_READY) { > 595 udelay(1); > 596 } > ... > > Notice the cache flush calls at line 587 and 589. With these, sending > succeeds. The driver still hangs in receiving afterwards, but at least this > part seems to work. If I remove either of the two added lines, it stops > working again. > > What is going on here? Why did this work with caches enabled before?? Are you sure caches were enabled before? Until recently absence of CONFIG_SYS_DCACHE_OFF did not mean that your D-cache is enabled because you also had to call the function dcache_enable(). Some platforms were calling this from board file or some drivers but not all. The following patch changed this for ARM: commit c2dd0d45540397704de9b13287417d21049d34c6 armv7: integrate cache maintenance support In this patch I added a call to dcache_enable() at the beginning of board_init_r() for ARM(i.e. as soon as relocation is over). As a result D-cache will be enabled for all ARM platforms now unless CONFIG_SYS_DCACHE_OFF is set. Looks like this is a real coherency issue that is brought out because d-cache is really enabled for you now. best regards, Aneesh ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-18 16:16 ` Aneesh V @ 2011-07-19 7:26 ` David Jander 2011-07-19 11:07 ` Matthias Weißer 1 sibling, 0 replies; 36+ messages in thread From: David Jander @ 2011-07-19 7:26 UTC (permalink / raw) To: u-boot On Mon, 18 Jul 2011 21:46:28 +0530 Aneesh V <aneesh@ti.com> wrote: > Hi David, > > On Monday 18 July 2011 08:48 PM, David Jander wrote: > > > > Hi all, > > > > I am busy debugging a problem with the i.MX51 FEC ethernet driver, that > > stopped working after upgrading u-boot. Before the upgrade I used > > v2010.06-rc3, which worked fine. > > I gave up on trying to find the difference beween the old version and this > > one that broke it.... due to BSP issues, git bisecting seems a monumental > > task I am not considering yet. > > The funny part is that it seems to work fine if I disable data-caches! > > With data caches enabled, it hangs in the following while loop in > > fec_send(), at line 592: > > > > ... > > 584 /* > > 585 * Enable SmartDMA transmit task > > 586 */ > > 587 fec_tx_task_enable(fec); > > 588 > > 589 /* > > 590 * wait until frame is sent . > > 591 */ > > 592 while (readw(&fec->tbd_base[fec->tbd_index].status)& > > FEC_TBD_READY) { 593 udelay(1); > > 594 } > > ... > > > > If I change this code in the following way, the while loop exits > > successfully: > > > > ... > > 584 /* > > 585 * Enable SmartDMA transmit task > > 586 */ > > 587 flush_cache(&fec->tbd_base[fec->tbd_index], 4); > > 588 fec_tx_task_enable(fec); > > 589 flash_dcache_all(); > > 590 > > 591 /* > > 592 * wait until frame is sent . > > 593 */ > > 594 while (readw(&fec->tbd_base[fec->tbd_index].status)& > > FEC_TBD_READY) { 595 udelay(1); > > 596 } > > ... > > > > Notice the cache flush calls at line 587 and 589. With these, sending > > succeeds. The driver still hangs in receiving afterwards, but at least this > > part seems to work. If I remove either of the two added lines, it stops > > working again. > > > > What is going on here? Why did this work with caches enabled before?? > > Are you sure caches were enabled before? Until recently absence of > CONFIG_SYS_DCACHE_OFF did not mean that your D-cache is enabled because > you also had to call the function dcache_enable(). Some platforms were > calling this from board file or some drivers but not all. We did. I am always calling dcache_enable(), icache_enable() and l2_cache_enable() from board code. The latter function I implemented myself, because nor u-boot nor the kernel does otherwise activate l2-cache, which has a big performance impact. Just in case, at the moment I am testing with only the L1 caches enabled. > The following patch changed this for ARM: > > commit c2dd0d45540397704de9b13287417d21049d34c6 > armv7: integrate cache maintenance support > > In this patch I added a call to dcache_enable() at the beginning of > board_init_r() for ARM(i.e. as soon as relocation is over). As a result > D-cache will be enabled for all ARM platforms now unless > CONFIG_SYS_DCACHE_OFF is set. Yes, I noticed that. > Looks like this is a real coherency issue that is brought out because > d-cache is really enabled for you now. AFAICS, it was always enabled for our board. At least I did always call i/dcache_enable() in board setup code.... was it broken before? Are the L1 caches enabled by (mainline) linux code? Best regards, -- David Jander Protonic Holland. ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-18 16:16 ` Aneesh V 2011-07-19 7:26 ` David Jander @ 2011-07-19 11:07 ` Matthias Weißer 2011-07-19 11:17 ` David Jander ` (2 more replies) 1 sibling, 3 replies; 36+ messages in thread From: Matthias Weißer @ 2011-07-19 11:07 UTC (permalink / raw) To: u-boot Dear Aneesh Am 18.07.2011 18:16, schrieb Aneesh V: > commit c2dd0d45540397704de9b13287417d21049d34c6 > armv7: integrate cache maintenance support > > In this patch I added a call to dcache_enable() at the beginning of > board_init_r() for ARM(i.e. as soon as relocation is over). As a result > D-cache will be enabled for all ARM platforms now unless > CONFIG_SYS_DCACHE_OFF is set. Is this really a good idea? This will break a couple of boards using non-cache-aware drivers. And there are a couple of them in u-boot. I think d-cache should be opt-in rather then opt-out as long as there are any drivers which didn't handle cached memory regions correct. i-cache is much less problematic and can be enabled by default. If d-cache will be enabled by default on ARM I think I have to send a patch for one of my boards :-) Regards Matthias ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-19 11:07 ` Matthias Weißer @ 2011-07-19 11:17 ` David Jander 2011-07-19 11:20 ` Wolfgang Denk 2011-07-19 11:19 ` Wolfgang Denk 2011-07-19 11:51 ` Aneesh V 2 siblings, 1 reply; 36+ messages in thread From: David Jander @ 2011-07-19 11:17 UTC (permalink / raw) To: u-boot On Tue, 19 Jul 2011 13:07:52 +0200 Matthias Wei?er <weisserm@arcor.de> wrote: > Dear Aneesh > > Am 18.07.2011 18:16, schrieb Aneesh V: > > commit c2dd0d45540397704de9b13287417d21049d34c6 > > armv7: integrate cache maintenance support > > > > In this patch I added a call to dcache_enable() at the beginning of > > board_init_r() for ARM(i.e. as soon as relocation is over). As a result > > D-cache will be enabled for all ARM platforms now unless > > CONFIG_SYS_DCACHE_OFF is set. > > Is this really a good idea? This will break a couple of boards using > non-cache-aware drivers. And there are a couple of them in u-boot. I > think d-cache should be opt-in rather then opt-out as long as there are > any drivers which didn't handle cached memory regions correct. i-cache > is much less problematic and can be enabled by default. > > If d-cache will be enabled by default on ARM I think I have to send a > patch for one of my boards :-) Exactly the issue I am having now... the fact that I was unaware of this patch and generally don't have much experience with solving cache-related issues apparently, made me waste a lot of time debugging the fec_mxc.c driver while the driver itself wasn't really broken (at least not more than before). Now I finally know what's wrong and am working on a proposed fix to make this one driver cache-aware. Best regards, -- David Jander Protonic Holland. ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-19 11:17 ` David Jander @ 2011-07-19 11:20 ` Wolfgang Denk 2011-07-19 12:10 ` David Jander 0 siblings, 1 reply; 36+ messages in thread From: Wolfgang Denk @ 2011-07-19 11:20 UTC (permalink / raw) To: u-boot Dear David Jander, In message <20110719131744.403a81e6@archvile> you wrote: > > Now I finally know what's wrong and am working on a proposed fix to make this > one driver cache-aware. I would just like to point out that these efforts are highly appreciated! Thanks! Wolfgang Denk -- DENX Software Engineering GmbH, MD: Wolfgang Denk & Detlev Zundel HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd at denx.de Chapter 1 -- The story so far: In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move. ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-19 11:20 ` Wolfgang Denk @ 2011-07-19 12:10 ` David Jander 2011-07-20 6:29 ` David Jander 0 siblings, 1 reply; 36+ messages in thread From: David Jander @ 2011-07-19 12:10 UTC (permalink / raw) To: u-boot On Tue, 19 Jul 2011 13:20:26 +0200 Wolfgang Denk <wd@denx.de> wrote: > Dear David Jander, > > In message <20110719131744.403a81e6@archvile> you wrote: > > > > Now I finally know what's wrong and am working on a proposed fix to make > > this one driver cache-aware. > > I would just like to point out that these efforts are highly > appreciated! Thanks. I'll try my best, but I am running out of time, and it is still not as it should. I still have trouble identifying the right place where receive buffer descriptors should be cache-invalidated. At one point, a magic flush_dcache_all() at a certain place in code made it work, but that can't be entirely correct, and if I replace it with only the necessary ranges, it doesn't work correctly anymore, so I guess this flush is also invalidating something I need elsewhere :-( I keep searching.... through the IMO fairly convoluted network code. Best regards, -- David Jander Protonic Holland. ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-19 12:10 ` David Jander @ 2011-07-20 6:29 ` David Jander 2011-07-20 8:56 ` Albert ARIBAUD 0 siblings, 1 reply; 36+ messages in thread From: David Jander @ 2011-07-20 6:29 UTC (permalink / raw) To: u-boot On Tue, 19 Jul 2011 14:10:48 +0200 David Jander <david.jander@protonic.nl> wrote: > On Tue, 19 Jul 2011 13:20:26 +0200 > Wolfgang Denk <wd@denx.de> wrote: > > > Dear David Jander, > > > > In message <20110719131744.403a81e6@archvile> you wrote: > > > > > > Now I finally know what's wrong and am working on a proposed fix to make > > > this one driver cache-aware. > > > > I would just like to point out that these efforts are highly > > appreciated! > > Thanks. > I'll try my best, but I am running out of time, and it is still not as it > should. I still have trouble identifying the right place where receive buffer > descriptors should be cache-invalidated. At one point, a magic > flush_dcache_all() at a certain place in code made it work, but that can't be > entirely correct, and if I replace it with only the necessary ranges, it > doesn't work correctly anymore, so I guess this flush is also invalidating > something I need elsewhere :-( > I keep searching.... through the IMO fairly convoluted network code. Ok, that was not such a problem. I found few nice spots, where cache-maintenance code should go IMO: In fec_rx_task_enable() cleaned RX BD's should be flushed. In fec_tx_task_enable() dirty TX BD's should be flushed, as well as data buffers. In fec_send() Invalidate BD status fields before reading, also in the while loop. In fec_recv() Invalidate BD and data buffers just before every read. And of course align all buffers to cache-line size and pad around the end of each buffer to cache-line size. Problem: It doesn't work (tftp transfer stops after a few packets)! :-( I am giving up! Plan B: I'll now try to allocate all buffers in internal RAM (which is not cached AFAIK). Any ideas? Best regards, -- David Jander Protonic Holland. ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-20 6:29 ` David Jander @ 2011-07-20 8:56 ` Albert ARIBAUD 2011-07-20 9:21 ` David Jander 0 siblings, 1 reply; 36+ messages in thread From: Albert ARIBAUD @ 2011-07-20 8:56 UTC (permalink / raw) To: u-boot Hi David, Le 20/07/2011 08:29, David Jander a ?crit : > On Tue, 19 Jul 2011 14:10:48 +0200 > David Jander<david.jander@protonic.nl> wrote: > >> On Tue, 19 Jul 2011 13:20:26 +0200 >> Wolfgang Denk<wd@denx.de> wrote: >> >>> Dear David Jander, >>> >>> In message<20110719131744.403a81e6@archvile> you wrote: >>>> >>>> Now I finally know what's wrong and am working on a proposed fix to make >>>> this one driver cache-aware. >>> >>> I would just like to point out that these efforts are highly >>> appreciated! >> >> Thanks. >> I'll try my best, but I am running out of time, and it is still not as it >> should. I still have trouble identifying the right place where receive buffer >> descriptors should be cache-invalidated. At one point, a magic >> flush_dcache_all() at a certain place in code made it work, but that can't be >> entirely correct, and if I replace it with only the necessary ranges, it >> doesn't work correctly anymore, so I guess this flush is also invalidating >> something I need elsewhere :-( >> I keep searching.... through the IMO fairly convoluted network code. > > Ok, that was not such a problem. I found few nice spots, where > cache-maintenance code should go IMO: > > In fec_rx_task_enable() cleaned RX BD's should be flushed. > In fec_tx_task_enable() dirty TX BD's should be flushed, as well as data > buffers. > In fec_send() Invalidate BD status fields before reading, also in the while > loop. > In fec_recv() Invalidate BD and data buffers just before every read. > > And of course align all buffers to cache-line size and pad around the end of > each buffer to cache-line size. > > Problem: It doesn't work (tftp transfer stops after a few packets)! :-( > I am giving up! > > Plan B: I'll now try to allocate all buffers in internal RAM (which is not > cached AFAIK). > > Any ideas? Yes, one: I had issues with the Marvell Ethernet adapter, which has DMA as well, not because of cache (it was not active at the time) but because of instruction reordering done by the compiler when optimizing, which resulted in out-of-order accesses to the descritpors and DMA registered, so that the DMA start was written before the descriptors were set up. The (proper and clean) solution was to introduce memory barriers at strategic points of the driver to ensure that specific DMA starts were done only after all writes to the descriptors (and possibly buffers). See drivers/net/mvgbe.c, and llok for 'isb()' instructions. > Best regards, Amicalement, -- Albert. ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-20 8:56 ` Albert ARIBAUD @ 2011-07-20 9:21 ` David Jander 2011-07-20 10:29 ` Aneesh V 0 siblings, 1 reply; 36+ messages in thread From: David Jander @ 2011-07-20 9:21 UTC (permalink / raw) To: u-boot On Wed, 20 Jul 2011 10:56:07 +0200 Albert ARIBAUD <albert.u.boot@aribaud.net> wrote: > Hi David, > > Le 20/07/2011 08:29, David Jander a ?crit : > > On Tue, 19 Jul 2011 14:10:48 +0200 > > David Jander<david.jander@protonic.nl> wrote: > > > >> On Tue, 19 Jul 2011 13:20:26 +0200 > >> Wolfgang Denk<wd@denx.de> wrote: > >> > >>> Dear David Jander, > >>> > >>> In message<20110719131744.403a81e6@archvile> you wrote: > >>>> > >>>> Now I finally know what's wrong and am working on a proposed fix to make > >>>> this one driver cache-aware. > >>> > >>> I would just like to point out that these efforts are highly > >>> appreciated! > >> > >> Thanks. > >> I'll try my best, but I am running out of time, and it is still not as it > >> should. I still have trouble identifying the right place where receive > >> buffer descriptors should be cache-invalidated. At one point, a magic > >> flush_dcache_all() at a certain place in code made it work, but that > >> can't be entirely correct, and if I replace it with only the necessary > >> ranges, it doesn't work correctly anymore, so I guess this flush is also > >> invalidating something I need elsewhere :-( > >> I keep searching.... through the IMO fairly convoluted network code. > > > > Ok, that was not such a problem. I found few nice spots, where > > cache-maintenance code should go IMO: > > > > In fec_rx_task_enable() cleaned RX BD's should be flushed. > > In fec_tx_task_enable() dirty TX BD's should be flushed, as well as data > > buffers. > > In fec_send() Invalidate BD status fields before reading, also in the while > > loop. > > In fec_recv() Invalidate BD and data buffers just before every read. > > > > And of course align all buffers to cache-line size and pad around the end > > of each buffer to cache-line size. > > > > Problem: It doesn't work (tftp transfer stops after a few packets)! :-( > > I am giving up! > > > > Plan B: I'll now try to allocate all buffers in internal RAM (which is not > > cached AFAIK). > > > > Any ideas? > > Yes, one: I had issues with the Marvell Ethernet adapter, which has DMA > as well, not because of cache (it was not active at the time) but > because of instruction reordering done by the compiler when optimizing, > which resulted in out-of-order accesses to the descritpors and DMA > registered, so that the DMA start was written before the descriptors > were set up. The (proper and clean) solution was to introduce memory > barriers at strategic points of the driver to ensure that specific DMA > starts were done only after all writes to the descriptors (and possibly > buffers). Hmmm. I know that issue, but AFAICS, the driver already uses readX() and writeX() macros when accessing register and DMA memory. Those macros have compiler barriers included.... and armv7 is still in-order execution ;-) Thanks for the suggestion anyway :-) Best regards, -- David Jander Protonic Holland. ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-20 9:21 ` David Jander @ 2011-07-20 10:29 ` Aneesh V 2011-07-20 11:31 ` David Jander 0 siblings, 1 reply; 36+ messages in thread From: Aneesh V @ 2011-07-20 10:29 UTC (permalink / raw) To: u-boot Hi David, On Wednesday 20 July 2011 02:51 PM, David Jander wrote: > On Wed, 20 Jul 2011 10:56:07 +0200 [snip ..] >>> Any ideas? >> >> Yes, one: I had issues with the Marvell Ethernet adapter, which has DMA >> as well, not because of cache (it was not active at the time) but >> because of instruction reordering done by the compiler when optimizing, >> which resulted in out-of-order accesses to the descritpors and DMA >> registered, so that the DMA start was written before the descriptors >> were set up. The (proper and clean) solution was to introduce memory >> barriers at strategic points of the driver to ensure that specific DMA >> starts were done only after all writes to the descriptors (and possibly >> buffers). > > Hmmm. I know that issue, but AFAICS, the driver already uses readX() and > writeX() macros when accessing register and DMA memory. Those macros have > compiler barriers included.... and armv7 is still in-order execution ;-) 1. armv7 is not necessarily in-order. Cortex-A8 is in-order while Cortex-A9 is out-of-order. 2. I am not sure if in-order execution guarantees memory ordering. My understanding is that it doesn't. 3. In u-boot's implementation of MMU for ARM, the register space(everything other than SDRAM) is 'strongly-ordered' memory. Strongly ordered accesses will be correctly ordered w.r.to all other accesses in program order. 4. Compiler barriers prevent the compiler from doing certain optimizations, but doesn't help in these situations. In short, I think you don't need to worry about ordering, but not because of compiler barriers or armv7 being in-order. br, Aneesh ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-20 10:29 ` Aneesh V @ 2011-07-20 11:31 ` David Jander 2011-07-20 12:05 ` Aneesh V 0 siblings, 1 reply; 36+ messages in thread From: David Jander @ 2011-07-20 11:31 UTC (permalink / raw) To: u-boot Hi Aneesh, On Wed, 20 Jul 2011 15:59:42 +0530 Aneesh V <aneesh@ti.com> wrote: > On Wednesday 20 July 2011 02:51 PM, David Jander wrote: > > On Wed, 20 Jul 2011 10:56:07 +0200 > > [snip ..] > > >>> Any ideas? > >> > >> Yes, one: I had issues with the Marvell Ethernet adapter, which has DMA > >> as well, not because of cache (it was not active at the time) but > >> because of instruction reordering done by the compiler when optimizing, > >> which resulted in out-of-order accesses to the descritpors and DMA > >> registered, so that the DMA start was written before the descriptors > >> were set up. The (proper and clean) solution was to introduce memory > >> barriers at strategic points of the driver to ensure that specific DMA > >> starts were done only after all writes to the descriptors (and possibly > >> buffers). > > > > Hmmm. I know that issue, but AFAICS, the driver already uses readX() and > > writeX() macros when accessing register and DMA memory. Those macros have > > compiler barriers included.... and armv7 is still in-order execution ;-) > > 1. armv7 is not necessarily in-order. Cortex-A8 is in-order while > Cortex-A9 is out-of-order. > 2. I am not sure if in-order execution guarantees memory ordering. My > understanding is that it doesn't. > 3. In u-boot's implementation of MMU for ARM, the register > space(everything other than SDRAM) is 'strongly-ordered' memory. > Strongly ordered accesses will be correctly ordered w.r.to all other > accesses in program order. Wow, I hadn't noticed dcache_enable() also enables the MMU. Cool. One step closer to splitting the mapping between a cached and uncached, strongly-ordered region of RAM. > 4. Compiler barriers prevent the compiler from doing certain > optimizations, but doesn't help in these situations. Thanks for this excellent summary! > In short, I think you don't need to worry about ordering, but not > because of compiler barriers or armv7 being in-order. Well, I do need to worry about the buffer-descriptors and buffers then, since they are in SDRAM, right? Best regards, -- David Jander Protonic Holland. ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-20 11:31 ` David Jander @ 2011-07-20 12:05 ` Aneesh V 0 siblings, 0 replies; 36+ messages in thread From: Aneesh V @ 2011-07-20 12:05 UTC (permalink / raw) To: u-boot Hi David, On Wednesday 20 July 2011 05:01 PM, David Jander wrote: > > Hi Aneesh, > > On Wed, 20 Jul 2011 15:59:42 +0530 > Aneesh V<aneesh@ti.com> wrote: >> On Wednesday 20 July 2011 02:51 PM, David Jander wrote: >>> On Wed, 20 Jul 2011 10:56:07 +0200 >> >> [snip ..] >> >>>>> Any ideas? >>>> >>>> Yes, one: I had issues with the Marvell Ethernet adapter, which has DMA >>>> as well, not because of cache (it was not active at the time) but >>>> because of instruction reordering done by the compiler when optimizing, >>>> which resulted in out-of-order accesses to the descritpors and DMA >>>> registered, so that the DMA start was written before the descriptors >>>> were set up. The (proper and clean) solution was to introduce memory >>>> barriers at strategic points of the driver to ensure that specific DMA >>>> starts were done only after all writes to the descriptors (and possibly >>>> buffers). >>> >>> Hmmm. I know that issue, but AFAICS, the driver already uses readX() and >>> writeX() macros when accessing register and DMA memory. Those macros have >>> compiler barriers included.... and armv7 is still in-order execution ;-) >> >> 1. armv7 is not necessarily in-order. Cortex-A8 is in-order while >> Cortex-A9 is out-of-order. >> 2. I am not sure if in-order execution guarantees memory ordering. My >> understanding is that it doesn't. >> 3. In u-boot's implementation of MMU for ARM, the register >> space(everything other than SDRAM) is 'strongly-ordered' memory. >> Strongly ordered accesses will be correctly ordered w.r.to all other >> accesses in program order. > > Wow, I hadn't noticed dcache_enable() also enables the MMU. Cool. One step > closer to splitting the mapping between a cached and uncached, strongly-ordered > region of RAM. Yes. MMU is enabled with an identity mapping. Please note that ARMv6 onwards D-cache can not be enabled when MMU is disabled. I-cache works without MMU enabled. > >> 4. Compiler barriers prevent the compiler from doing certain >> optimizations, but doesn't help in these situations. > > Thanks for this excellent summary! > >> In short, I think you don't need to worry about ordering, but not >> because of compiler barriers or armv7 being in-order. > > Well, I do need to worry about the buffer-descriptors and buffers then, since > they are in SDRAM, right? Yes, you need to worry about buffers and descriptors. Here are the rules-of-thumb to take care of them: 1. Flush before initiating memory to peripheral DMA(this applies to descriptors too) 2. Invalidate after peripheral to memory DMA (before the CPU reads the buffer) 3. Make sure that any buffer that you have to invalidate is cache-line aligned at the beginning and at the end. br, Aneesh ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-19 11:07 ` Matthias Weißer 2011-07-19 11:17 ` David Jander @ 2011-07-19 11:19 ` Wolfgang Denk 2011-07-19 14:31 ` Matthias Weißer 2011-07-19 11:51 ` Aneesh V 2 siblings, 1 reply; 36+ messages in thread From: Wolfgang Denk @ 2011-07-19 11:19 UTC (permalink / raw) To: u-boot Dear =?ISO-8859-1?Q?Matthias_Wei=DFer?=, In message <4E256588.4010301@arcor.de> you wrote: > > Is this really a good idea? This will break a couple of boards using > non-cache-aware drivers. And there are a couple of them in u-boot. I > think d-cache should be opt-in rather then opt-out as long as there are > any drivers which didn't handle cached memory regions correct. i-cache > is much less problematic and can be enabled by default. If we do this, then everybody will just be lazy, and nothing will ever change. The lesson we learned (a number of times) is that people only start moving when it really hurts, i. e. when things really break for them. I don't like that, but if it's the only way to make any progress then so be it. > If d-cache will be enabled by default on ARM I think I have to send a > patch for one of my boards :-) Why don't you just help identifying and fixing the problems in the misbehaving drivers?!? That would be a _much_ more helpful approach. Thanks. Wolfgang Denk -- DENX Software Engineering GmbH, MD: Wolfgang Denk & Detlev Zundel HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd at denx.de "I've finally learned what `upward compatible' means. It means we get to keep all our old mistakes." - Dennie van Tassel ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-19 11:19 ` Wolfgang Denk @ 2011-07-19 14:31 ` Matthias Weißer 0 siblings, 0 replies; 36+ messages in thread From: Matthias Weißer @ 2011-07-19 14:31 UTC (permalink / raw) To: u-boot Am 19.07.2011 13:19, schrieb Wolfgang Denk: > Dear =?ISO-8859-1?Q?Matthias_Wei=DFer?=, > > In message<4E256588.4010301@arcor.de> you wrote: >> >> Is this really a good idea? This will break a couple of boards using >> non-cache-aware drivers. And there are a couple of them in u-boot. I >> think d-cache should be opt-in rather then opt-out as long as there are >> any drivers which didn't handle cached memory regions correct. i-cache >> is much less problematic and can be enabled by default. > > If we do this, then everybody will just be lazy, and nothing will ever > change. You are right. But time is a limited resource. And fixing cache related issues is a bit more complex then, for example, the simple changes needed when relocation was introduced in ARM. >> If d-cache will be enabled by default on ARM I think I have to send a >> patch for one of my boards :-) > > Why don't you just help identifying and fixing the problems in the > misbehaving drivers?!? > > That would be a _much_ more helpful approach. Well, that is what I meant. I have to fix the driver then. But this will interfere with the fact that time is a lim.... You know ;-) If both of my boards are in mainline I will take a look at them and try to fix problematic drivers (mainly USB OHCI should be the problem) Regards Matthias ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-19 11:07 ` Matthias Weißer 2011-07-19 11:17 ` David Jander 2011-07-19 11:19 ` Wolfgang Denk @ 2011-07-19 11:51 ` Aneesh V 2 siblings, 0 replies; 36+ messages in thread From: Aneesh V @ 2011-07-19 11:51 UTC (permalink / raw) To: u-boot Dear Matthias, On Tuesday 19 July 2011 04:37 PM, Matthias Wei?er wrote: > Dear Aneesh > > Am 18.07.2011 18:16, schrieb Aneesh V: >> commit c2dd0d45540397704de9b13287417d21049d34c6 >> armv7: integrate cache maintenance support >> >> In this patch I added a call to dcache_enable() at the beginning of >> board_init_r() for ARM(i.e. as soon as relocation is over). As a result >> D-cache will be enabled for all ARM platforms now unless >> CONFIG_SYS_DCACHE_OFF is set. > > Is this really a good idea? This will break a couple of boards using > non-cache-aware drivers. And there are a couple of them in u-boot. I > think d-cache should be opt-in rather then opt-out as long as there are > any drivers which didn't handle cached memory regions correct. i-cache > is much less problematic and can be enabled by default. D-cache was always an opt-out in u-boot based on the fact that the flag was CONFIG_SYS_DCACHE_OFF and not something like CONFIG_SYS_DCACHE_ENABLE. I just made the behavior more consistent and predictable. Also, my idea was to enable it(if allowed by the config flag) at the earliest possible time rather than waiting for a random driver or board file to enable it much later. Regarding the question of whether it should be opt-out or opt-in I have no preference myself. best regards, Aneesh ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-18 15:18 [U-Boot] i.MX51: FEC: Cache coherency problem? David Jander 2011-07-18 16:16 ` Aneesh V @ 2011-07-18 16:55 ` Stefano Babic 2011-07-19 7:44 ` David Jander 1 sibling, 1 reply; 36+ messages in thread From: Stefano Babic @ 2011-07-18 16:55 UTC (permalink / raw) To: u-boot On 07/18/2011 05:18 PM, David Jander wrote: > > Hi all, Hi David, > What is going on here? Why did this work with caches enabled before?? I think cache was always disabled.. Best regards, Stefano Babic -- ===================================================================== DENX Software Engineering GmbH, MD: Wolfgang Denk & Detlev Zundel HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: +49-8142-66989-0 Fax: +49-8142-66989-80 Email: office at denx.de ===================================================================== ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-18 16:55 ` Stefano Babic @ 2011-07-19 7:44 ` David Jander 2011-07-19 8:21 ` Albert ARIBAUD 0 siblings, 1 reply; 36+ messages in thread From: David Jander @ 2011-07-19 7:44 UTC (permalink / raw) To: u-boot Hi Stefano, On Mon, 18 Jul 2011 18:55:05 +0200 Stefano Babic <sbabic@denx.de> wrote: > On 07/18/2011 05:18 PM, David Jander wrote: > > > > Hi all, > > Hi David, > > > What is going on here? Why did this work with caches enabled before?? > > I think cache was always disabled.. I had even L2-caches enabled in u-boot (copied/adapted some code from OMAP cache.S), and called i/dcache_enable() from board code like this: int board_late_init(void) { power_init(); probe_board_type(); icache_enable(); dcache_enable(); return 0; } Is there a reason this wouldn't have worked before? Suppose it didn't. Does that mean we need to use the MMU to properly mark regions of register space and specially FEC BD's as not-cached? Or do we need to flash caches manually each time such a memory region is accessed? I am kind of a CPU-speed-junkie, so I am not sure I want to live without caches enabled in u-boot ;-) Best regards, -- David Jander Protonic Holland. ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-19 7:44 ` David Jander @ 2011-07-19 8:21 ` Albert ARIBAUD 2011-07-19 8:37 ` David Jander 0 siblings, 1 reply; 36+ messages in thread From: Albert ARIBAUD @ 2011-07-19 8:21 UTC (permalink / raw) To: u-boot Hi David, Le 19/07/2011 09:44, David Jander a ?crit : > > Hi Stefano, > > On Mon, 18 Jul 2011 18:55:05 +0200 > Stefano Babic<sbabic@denx.de> wrote: > >> On 07/18/2011 05:18 PM, David Jander wrote: >>> >>> Hi all, >> >> Hi David, >> >>> What is going on here? Why did this work with caches enabled before?? >> >> I think cache was always disabled.. > > I had even L2-caches enabled in u-boot (copied/adapted some code from OMAP > cache.S), and called i/dcache_enable() from board code like this: > > int board_late_init(void) > { > power_init(); > probe_board_type(); > icache_enable(); > dcache_enable(); > > return 0; > } > > Is there a reason this wouldn't have worked before? > > Suppose it didn't. Does that mean we need to use the MMU to properly mark > regions of register space and specially FEC BD's as not-cached? Or do we need > to flash caches manually each time such a memory region is accessed? > I am kind of a CPU-speed-junkie, so I am not sure I want to live without > caches enabled in u-boot ;-) You would have to flush (before sending packets / starting external memory-to-device DMA) and invalidate (before reading received packets / after external device-to-memory DMA is done); using MMU and mapping cached/non-cached areas is IMO overkill, and will hurt CPU accesses to the xmit/receive buffers and descriptors. > Best regards, Amicalement, -- Albert. ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-19 8:21 ` Albert ARIBAUD @ 2011-07-19 8:37 ` David Jander 2011-07-19 8:43 ` Aneesh V 0 siblings, 1 reply; 36+ messages in thread From: David Jander @ 2011-07-19 8:37 UTC (permalink / raw) To: u-boot On Tue, 19 Jul 2011 10:21:12 +0200 Albert ARIBAUD <albert.u.boot@aribaud.net> wrote: > Hi David, > > Le 19/07/2011 09:44, David Jander a ?crit : > > > > Hi Stefano, > > > > On Mon, 18 Jul 2011 18:55:05 +0200 > > Stefano Babic<sbabic@denx.de> wrote: > > > >> On 07/18/2011 05:18 PM, David Jander wrote: > >>> > >>> Hi all, > >> > >> Hi David, > >> > >>> What is going on here? Why did this work with caches enabled before?? > >> > >> I think cache was always disabled.. > > > > I had even L2-caches enabled in u-boot (copied/adapted some code from OMAP > > cache.S), and called i/dcache_enable() from board code like this: > > > > int board_late_init(void) > > { > > power_init(); > > probe_board_type(); > > icache_enable(); > > dcache_enable(); > > > > return 0; > > } > > > > Is there a reason this wouldn't have worked before? > > > > Suppose it didn't. Does that mean we need to use the MMU to properly mark > > regions of register space and specially FEC BD's as not-cached? Or do we > > need to flash caches manually each time such a memory region is accessed? > > I am kind of a CPU-speed-junkie, so I am not sure I want to live without > > caches enabled in u-boot ;-) > > You would have to flush (before sending packets / starting external > memory-to-device DMA) and invalidate (before reading received packets / > after external device-to-memory DMA is done); using MMU and mapping > cached/non-cached areas is IMO overkill, and will hurt CPU accesses to > the xmit/receive buffers and descriptors. So, you say actually what I did while exploring the problem would have been a correct way of solving this problem? Like this: 587 flush_cache(&fec->tbd_base[fec->tbd_index], 4); 588 fec_tx_task_enable(fec); 589 flush_dcache_all(); 590 591 /* 592 * wait until frame is sent . 593 */ 594 while (readw(&fec->tbd_base[fec->tbd_index].status) & FEC_TBD_READY) { 595 udelay(1); 596 } I am still not sure why I need line 587 above. Would a patch to fec_mxc.c that does the necessary cache handwork be acceptable for u-boot? Best regards, -- David Jander Protonic Holland. ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-19 8:37 ` David Jander @ 2011-07-19 8:43 ` Aneesh V 2011-07-19 8:58 ` David Jander 2011-07-19 9:05 ` Albert ARIBAUD 0 siblings, 2 replies; 36+ messages in thread From: Aneesh V @ 2011-07-19 8:43 UTC (permalink / raw) To: u-boot On Tuesday 19 July 2011 02:07 PM, David Jander wrote: > On Tue, 19 Jul 2011 10:21:12 +0200 > Albert ARIBAUD<albert.u.boot@aribaud.net> wrote: > >> Hi David, >> >> Le 19/07/2011 09:44, David Jander a ?crit : >>> >>> Hi Stefano, >>> >>> On Mon, 18 Jul 2011 18:55:05 +0200 >>> Stefano Babic<sbabic@denx.de> wrote: >>> >>>> On 07/18/2011 05:18 PM, David Jander wrote: >>>>> >>>>> Hi all, >>>> >>>> Hi David, >>>> >>>>> What is going on here? Why did this work with caches enabled before?? >>>> >>>> I think cache was always disabled.. >>> >>> I had even L2-caches enabled in u-boot (copied/adapted some code from OMAP >>> cache.S), and called i/dcache_enable() from board code like this: >>> >>> int board_late_init(void) >>> { >>> power_init(); >>> probe_board_type(); >>> icache_enable(); >>> dcache_enable(); >>> >>> return 0; >>> } >>> >>> Is there a reason this wouldn't have worked before? >>> >>> Suppose it didn't. Does that mean we need to use the MMU to properly mark >>> regions of register space and specially FEC BD's as not-cached? Or do we >>> need to flash caches manually each time such a memory region is accessed? >>> I am kind of a CPU-speed-junkie, so I am not sure I want to live without >>> caches enabled in u-boot ;-) Are you talking about some memory-mapped IO region(register space). If that is the case, that region won't be cached. ARM mmu implementation makes only the SDRAM region cached. Rest is non-cached non-buffered. >> >> You would have to flush (before sending packets / starting external >> memory-to-device DMA) and invalidate (before reading received packets / >> after external device-to-memory DMA is done); using MMU and mapping >> cached/non-cached areas is IMO overkill, and will hurt CPU accesses to >> the xmit/receive buffers and descriptors. > > So, you say actually what I did while exploring the problem would have been a > correct way of solving this problem? > > Like this: > > 587 flush_cache(&fec->tbd_base[fec->tbd_index], 4); This is what is needed assuming the below is initiating a memory to peripheral DMA. Is your buffer only 4 bytes long? Also, please check if flush_cache() is correctly supported for your CPU. The default implementation in in arch/arm/lib/cache.c has support for only a handful of cpus. AFAIK, only armv7 is over-riding this default implementation at the moment. The fact that it's helping you indicates that it may be working for you. But still worth a check. > 588 fec_tx_task_enable(fec); > 589 flush_dcache_all(); This should not be needed. > 590 > 591 /* > 592 * wait until frame is sent . > 593 */ > 594 while (readw(&fec->tbd_base[fec->tbd_index].status)& FEC_TBD_READY) { > 595 udelay(1); > 596 } > > I am still not sure why I need line 587 above. Did you try keeping 587 and removing 589? best regards, Aneesh ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-19 8:43 ` Aneesh V @ 2011-07-19 8:58 ` David Jander 2011-07-19 9:11 ` Albert ARIBAUD 2011-07-19 11:42 ` Aneesh V 2011-07-19 9:05 ` Albert ARIBAUD 1 sibling, 2 replies; 36+ messages in thread From: David Jander @ 2011-07-19 8:58 UTC (permalink / raw) To: u-boot Dear Aneesh, Thanks a lot for your replies. On Tue, 19 Jul 2011 14:13:34 +0530 Aneesh V <aneesh@ti.com> wrote: > On Tuesday 19 July 2011 02:07 PM, David Jander wrote: > > On Tue, 19 Jul 2011 10:21:12 +0200 > > Albert ARIBAUD<albert.u.boot@aribaud.net> wrote: > > > >> Hi David, > >> > >> Le 19/07/2011 09:44, David Jander a ?crit : > >>> > >>> Hi Stefano, > >>> > >>> On Mon, 18 Jul 2011 18:55:05 +0200 > >>> Stefano Babic<sbabic@denx.de> wrote: > >>> > >>>> On 07/18/2011 05:18 PM, David Jander wrote: > >>>>> > >>>>> Hi all, > >>>> > >>>> Hi David, > >>>> > >>>>> What is going on here? Why did this work with caches enabled before?? > >>>> > >>>> I think cache was always disabled.. > >>> > >>> I had even L2-caches enabled in u-boot (copied/adapted some code from > >>> OMAP cache.S), and called i/dcache_enable() from board code like this: > >>> > >>> int board_late_init(void) > >>> { > >>> power_init(); > >>> probe_board_type(); > >>> icache_enable(); > >>> dcache_enable(); > >>> > >>> return 0; > >>> } > >>> > >>> Is there a reason this wouldn't have worked before? > >>> > >>> Suppose it didn't. Does that mean we need to use the MMU to properly mark > >>> regions of register space and specially FEC BD's as not-cached? Or do we > >>> need to flash caches manually each time such a memory region is accessed? > >>> I am kind of a CPU-speed-junkie, so I am not sure I want to live without > >>> caches enabled in u-boot ;-) > > Are you talking about some memory-mapped IO region(register space). If > that is the case, that region won't be cached. ARM mmu implementation > makes only the SDRAM region cached. Rest is non-cached non-buffered. Ah, thanks for pointing out. I already suspected something like that... > >> You would have to flush (before sending packets / starting external > >> memory-to-device DMA) and invalidate (before reading received packets / > >> after external device-to-memory DMA is done); using MMU and mapping > >> cached/non-cached areas is IMO overkill, and will hurt CPU accesses to > >> the xmit/receive buffers and descriptors. > > > > So, you say actually what I did while exploring the problem would have > > been a correct way of solving this problem? > > > > Like this: > > > > 587 flush_cache(&fec->tbd_base[fec->tbd_index], 4); > > This is what is needed assuming the below is initiating a memory to > peripheral DMA. Is your buffer only 4 bytes long? No it isn't. I know, I should flush the whole buffer area, but this was just enough to get the status field flushed, so the FEC started transmitting, and the while loop ended eventually. The result was still not correct, but at least it won't hang. What would be more expensive, flushing just the buffer area, or flush_dcache_all()? > Also, please check if flush_cache() is correctly supported for your > CPU. The default implementation in in arch/arm/lib/cache.c has support > for only a handful of cpus. AFAIK, only armv7 is over-riding this > default implementation at the moment. There is cache_v7.c which implements these... they are supposed to work correctly I guess? > The fact that it's helping you indicates that it may be working for > you. But still worth a check. > > > 588 fec_tx_task_enable(fec); > > 589 flush_dcache_all(); > > This should not be needed. I agree, but without it the while loop below still gets stuck. > > 590 > > 591 /* > > 592 * wait until frame is sent . > > 593 */ > > 594 while (readw(&fec->tbd_base[fec->tbd_index].status)& > > FEC_TBD_READY) { > > 595 udelay(1); > > 596 } > > > > I am still not sure why I need line 587 above. > > Did you try keeping 587 and removing 589? Yes, I did. These two lines really is the minimum necessary to exit the while loop. It won't work if I leave out either of those. Now that I know a little more, I guess this is because of the status field in the buffer-descriptor being checked in the while loop, and that is still in cache. So the only thing line 589 does, is invalidate the caches, so the next readw() returns the value stored by the FEC, which is apparently faster than this piece of code :-) Best regards, -- David Jander Protonic Holland. ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-19 8:58 ` David Jander @ 2011-07-19 9:11 ` Albert ARIBAUD 2011-07-19 11:50 ` Aneesh V 2011-07-19 11:42 ` Aneesh V 1 sibling, 1 reply; 36+ messages in thread From: Albert ARIBAUD @ 2011-07-19 9:11 UTC (permalink / raw) To: u-boot Hi David, Le 19/07/2011 10:58, David Jander a ?crit : >>> 587 flush_cache(&fec->tbd_base[fec->tbd_index], 4); >> >> This is what is needed assuming the below is initiating a memory to >> peripheral DMA. Is your buffer only 4 bytes long? > > No it isn't. I know, I should flush the whole buffer area, but this was just > enough to get the status field flushed, so the FEC started transmitting, and > the while loop ended eventually. The result was still not correct, but at > least it won't hang. Seems like you're flushing while the DMA engine is running... That's calling for race conditions IMO. You should only have the DMA engine running while you are sending something or while you are expecting to receive something (U-Boot differs from an OS in that devices are not kept running unless actually needed). > What would be more expensive, flushing just the buffer area, or > flush_dcache_all()? You should call the API for flushing the area only. Obviously, you'll flush either just as much, or possibly more (thus take longer) if you flush the whole data cache. > Yes, I did. These two lines really is the minimum necessary to exit the while > loop. It won't work if I leave out either of those. > Now that I know a little more, I guess this is because of the status field in > the buffer-descriptor being checked in the while loop, and that is still in > cache. So the only thing line 589 does, is invalidate the caches, so the next > readw() returns the value stored by the FEC, which is apparently faster than > this piece of code :-) One more reason not to let it run its DMA until you're ready. :) > Best regards, Amicalement, -- Albert. ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-19 9:11 ` Albert ARIBAUD @ 2011-07-19 11:50 ` Aneesh V 0 siblings, 0 replies; 36+ messages in thread From: Aneesh V @ 2011-07-19 11:50 UTC (permalink / raw) To: u-boot Hi Albert, David, On Tuesday 19 July 2011 02:41 PM, Albert ARIBAUD wrote: > Hi David, > > Le 19/07/2011 10:58, David Jander a ?crit : > >>>> 587 flush_cache(&fec->tbd_base[fec->tbd_index], 4); >>> >>> This is what is needed assuming the below is initiating a memory to >>> peripheral DMA. Is your buffer only 4 bytes long? >> >> No it isn't. I know, I should flush the whole buffer area, but this >> was just >> enough to get the status field flushed, so the FEC started >> transmitting, and >> the while loop ended eventually. The result was still not correct, but at >> least it won't hang. > > Seems like you're flushing while the DMA engine is running... That's > calling for race conditions IMO. You should only have the DMA engine > running while you are sending something or while you are expecting to > receive something (U-Boot differs from an OS in that devices are not > kept running unless actually needed). > >> What would be more expensive, flushing just the buffer area, or >> flush_dcache_all()? > > You should call the API for flushing the area only. Obviously, you'll > flush either just as much, or possibly more (thus take longer) if you > flush the whole data cache. Flushing an entire cache line by line is typically not as efficient as using the special instructions to flush the entire cache(particularly for large L2 caches). In our experiments on the 1MB L2$ cache on Cortex-A9 the latter was more than 2x fast compared to the former. best regards, Aneesh ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-19 8:58 ` David Jander 2011-07-19 9:11 ` Albert ARIBAUD @ 2011-07-19 11:42 ` Aneesh V 1 sibling, 0 replies; 36+ messages in thread From: Aneesh V @ 2011-07-19 11:42 UTC (permalink / raw) To: u-boot On Tuesday 19 July 2011 02:28 PM, David Jander wrote: > > Dear Aneesh, > > Thanks a lot for your replies. > > On Tue, 19 Jul 2011 14:13:34 +0530 > Aneesh V<aneesh@ti.com> wrote: >> On Tuesday 19 July 2011 02:07 PM, David Jander wrote: >>> On Tue, 19 Jul 2011 10:21:12 +0200 >>> Albert ARIBAUD<albert.u.boot@aribaud.net> wrote: >>> >>>> Hi David, >>>> >>>> Le 19/07/2011 09:44, David Jander a ?crit : >>>>> >>>>> Hi Stefano, >>>>> >>>>> On Mon, 18 Jul 2011 18:55:05 +0200 >>>>> Stefano Babic<sbabic@denx.de> wrote: >>>>> >>>>>> On 07/18/2011 05:18 PM, David Jander wrote: >>>>>>> >>>>>>> Hi all, >>>>>> >>>>>> Hi David, >>>>>> >>>>>>> What is going on here? Why did this work with caches enabled before?? >>>>>> >>>>>> I think cache was always disabled.. >>>>> >>>>> I had even L2-caches enabled in u-boot (copied/adapted some code from >>>>> OMAP cache.S), and called i/dcache_enable() from board code like this: >>>>> >>>>> int board_late_init(void) >>>>> { >>>>> power_init(); >>>>> probe_board_type(); >>>>> icache_enable(); >>>>> dcache_enable(); >>>>> >>>>> return 0; >>>>> } >>>>> >>>>> Is there a reason this wouldn't have worked before? >>>>> >>>>> Suppose it didn't. Does that mean we need to use the MMU to properly mark >>>>> regions of register space and specially FEC BD's as not-cached? Or do we >>>>> need to flash caches manually each time such a memory region is accessed? >>>>> I am kind of a CPU-speed-junkie, so I am not sure I want to live without >>>>> caches enabled in u-boot ;-) >> >> Are you talking about some memory-mapped IO region(register space). If >> that is the case, that region won't be cached. ARM mmu implementation >> makes only the SDRAM region cached. Rest is non-cached non-buffered. > > Ah, thanks for pointing out. I already suspected something like that... > >>>> You would have to flush (before sending packets / starting external >>>> memory-to-device DMA) and invalidate (before reading received packets / >>>> after external device-to-memory DMA is done); using MMU and mapping >>>> cached/non-cached areas is IMO overkill, and will hurt CPU accesses to >>>> the xmit/receive buffers and descriptors. >>> >>> So, you say actually what I did while exploring the problem would have >>> been a correct way of solving this problem? >>> >>> Like this: >>> >>> 587 flush_cache(&fec->tbd_base[fec->tbd_index], 4); >> >> This is what is needed assuming the below is initiating a memory to >> peripheral DMA. Is your buffer only 4 bytes long? > > No it isn't. I know, I should flush the whole buffer area, but this was just > enough to get the status field flushed, so the FEC started transmitting, and > the while loop ended eventually. The result was still not correct, but at > least it won't hang. > What would be more expensive, flushing just the buffer area, or > flush_dcache_all()? That depends on the buffer size. If the buffer size is really big (comparable to L2$ size) flush_dcache_all() may turn out to be better. > >> Also, please check if flush_cache() is correctly supported for your >> CPU. The default implementation in in arch/arm/lib/cache.c has support >> for only a handful of cpus. AFAIK, only armv7 is over-riding this >> default implementation at the moment. > > There is cache_v7.c which implements these... they are supposed to work > correctly I guess? Oops! I didn't notice that your CPU is armv7. Yes, it's supposed to work correctly. best regards, Aneesh ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-19 8:43 ` Aneesh V 2011-07-19 8:58 ` David Jander @ 2011-07-19 9:05 ` Albert ARIBAUD 2011-07-19 14:36 ` J. William Campbell 1 sibling, 1 reply; 36+ messages in thread From: Albert ARIBAUD @ 2011-07-19 9:05 UTC (permalink / raw) To: u-boot Le 19/07/2011 10:43, Aneesh V a ?crit : >>> You would have to flush (before sending packets / starting external >>> memory-to-device DMA) and invalidate (before reading received packets / >>> after external device-to-memory DMA is done); using MMU and mapping >>> cached/non-cached areas is IMO overkill, and will hurt CPU accesses to >>> the xmit/receive buffers and descriptors. >> >> So, you say actually what I did while exploring the problem would have >> been a >> correct way of solving this problem? >> >> Like this: >> >> 587 flush_cache(&fec->tbd_base[fec->tbd_index], 4); > > This is what is needed assuming the below is initiating a memory to > peripheral DMA. Is your buffer only 4 bytes long? Generally: - for sending data through a device that has its own, external, DMA engine, you'll obviously need to flush the data buffer(s) but also any DMA descriptors used by the engine, before you start the engine; - for rceiving, if you have to set up receive descriptors, you must flush that before telling the device to enter receive mode (so that the device reads the descriptors as you wrote them), and you should invalidate the receive buffers at the latest when the device signals that data has been received, or preferably long before (at the same time you flushed the read descriptor, so that cache-related actions are grouped in the same place in the code). Amicalement, -- Albert. ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-19 9:05 ` Albert ARIBAUD @ 2011-07-19 14:36 ` J. William Campbell 2011-07-19 15:17 ` David Jander 2011-07-19 18:14 ` Anton Staaf 0 siblings, 2 replies; 36+ messages in thread From: J. William Campbell @ 2011-07-19 14:36 UTC (permalink / raw) To: u-boot On 7/19/2011 2:05 AM, Albert ARIBAUD wrote: > Le 19/07/2011 10:43, Aneesh V a ?crit : > >>>> You would have to flush (before sending packets / starting external >>>> memory-to-device DMA) and invalidate (before reading received packets / >>>> after external device-to-memory DMA is done); using MMU and mapping >>>> cached/non-cached areas is IMO overkill, and will hurt CPU accesses to >>>> the xmit/receive buffers and descriptors. >>> So, you say actually what I did while exploring the problem would have >>> been a >>> correct way of solving this problem? >>> >>> Like this: >>> >>> 587 flush_cache(&fec->tbd_base[fec->tbd_index], 4); >> This is what is needed assuming the below is initiating a memory to >> peripheral DMA. Is your buffer only 4 bytes long? > Generally: > > - for sending data through a device that has its own, external, DMA > engine, you'll obviously need to flush the data buffer(s) but also any > DMA descriptors used by the engine, before you start the engine; > > - for rceiving, if you have to set up receive descriptors, you must > flush that before telling the device to enter receive mode (so that the > device reads the descriptors as you wrote them), and you should > invalidate the receive buffers at the latest when the device signals > that data has been received, Hi All, > or preferably long before (at the same time > you flushed the read descriptor, so that cache-related actions are > grouped in the same place in the code). I think this last statement is incorrect. You should invalidate the cache for the receive buffers just before you intend to reference them. If you do it right after receive mode is entered, subsequent access to items NEAR the receive buffer may reload the cache with receive buffer data before the dma is done, re-creating the problem you are trying to avoid. Also, I don't know if ARM cache is write-back or write -thru, but if it is write-back, the only way to avoid problems is to allocate the receive buffers on cache line boundaries, so no "nearby" writes can cause something in the DMA buffer to be corrupted. If all receive buffers are allocated on cache-line boundaries (both start and end of each buffer), you can invalidate the cache "early" under the assumption that there will be no read accesses to the read buffers themselves until after DMA is complete. IMHO it is better, even in this case., to invalidate cache after dma is done but before referencing the read data. Best Regards, Bill Campbell > > Amicalement, ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-19 14:36 ` J. William Campbell @ 2011-07-19 15:17 ` David Jander 2011-07-19 18:14 ` Anton Staaf 1 sibling, 0 replies; 36+ messages in thread From: David Jander @ 2011-07-19 15:17 UTC (permalink / raw) To: u-boot On Tue, 19 Jul 2011 07:36:32 -0700 "J. William Campbell" <jwilliamcampbell@comcast.net> wrote: > On 7/19/2011 2:05 AM, Albert ARIBAUD wrote: > > Le 19/07/2011 10:43, Aneesh V a ?crit : > > > >>>> You would have to flush (before sending packets / starting external > >>>> memory-to-device DMA) and invalidate (before reading received packets / > >>>> after external device-to-memory DMA is done); using MMU and mapping > >>>> cached/non-cached areas is IMO overkill, and will hurt CPU accesses to > >>>> the xmit/receive buffers and descriptors. > >>> So, you say actually what I did while exploring the problem would have > >>> been a > >>> correct way of solving this problem? > >>> > >>> Like this: > >>> > >>> 587 flush_cache(&fec->tbd_base[fec->tbd_index], 4); > >> This is what is needed assuming the below is initiating a memory to > >> peripheral DMA. Is your buffer only 4 bytes long? > > Generally: > > > > - for sending data through a device that has its own, external, DMA > > engine, you'll obviously need to flush the data buffer(s) but also any > > DMA descriptors used by the engine, before you start the engine; > > > > - for rceiving, if you have to set up receive descriptors, you must > > flush that before telling the device to enter receive mode (so that the > > device reads the descriptors as you wrote them), and you should > > invalidate the receive buffers at the latest when the device signals > > that data has been received, > Hi All, > > > or preferably long before (at the same time > > you flushed the read descriptor, so that cache-related actions are > > grouped in the same place in the code). > I think this last statement is incorrect. You should invalidate the > cache for the receive buffers just before you intend to reference them. > If you do it right after receive mode is entered, subsequent access to > items NEAR the receive buffer may reload the cache with receive buffer > data before the dma is done, re-creating the problem you are trying to > avoid. Also, I don't know if ARM cache is write-back or write -thru, but > if it is write-back, the only way to avoid problems is to allocate the > receive buffers on cache line boundaries, so no "nearby" writes can > cause something in the DMA buffer to be corrupted. If all receive > buffers are allocated on cache-line boundaries (both start and end of > each buffer), you can invalidate the cache "early" under the assumption > that there will be no read accesses to the read buffers themselves until > after DMA is complete. IMHO it is better, even in this case., to > invalidate cache after dma is done but before referencing the read data. Hey, thanks a lot for this bit of information! I really needed it! :-) Indeed I hadn't thought about the situation of nearby memory accesses allocating critical cache-lines... Coming to think of it... would it really be that hard to enable the MMU in u-boot and configure just two pools of flat, direct mapped linear memory to malloc and dma_malloc from? That would make fixing broken drivers trivial, and have other benefits (like video memory coherence when caches are on)! IMHO, changing the necessary malloc()'s for dma_malloc()'s is a lot less work and easier to oversee than obscure cache-manipulation and alignment tricks all over the place.... What do you think? Best regards, -- David Jander Protonic Holland. ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-19 14:36 ` J. William Campbell 2011-07-19 15:17 ` David Jander @ 2011-07-19 18:14 ` Anton Staaf 2011-07-19 20:11 ` J. William Campbell 2011-07-20 8:37 ` Aneesh V 1 sibling, 2 replies; 36+ messages in thread From: Anton Staaf @ 2011-07-19 18:14 UTC (permalink / raw) To: u-boot On Tue, Jul 19, 2011 at 7:36 AM, J. William Campbell <jwilliamcampbell@comcast.net> wrote: > On 7/19/2011 2:05 AM, Albert ARIBAUD wrote: >> Le 19/07/2011 10:43, Aneesh V a ?crit : >> >>>>> You would have to flush (before sending packets / starting external >>>>> memory-to-device DMA) and invalidate (before reading received packets / >>>>> after external device-to-memory DMA is done); using MMU and mapping >>>>> cached/non-cached areas is IMO overkill, and will hurt CPU accesses to >>>>> the xmit/receive buffers and descriptors. >>>> So, you say actually what I did while exploring the problem would have >>>> been a >>>> correct way of solving this problem? >>>> >>>> Like this: >>>> >>>> 587 flush_cache(&fec->tbd_base[fec->tbd_index], 4); >>> This is what is needed assuming the below is initiating a memory to >>> peripheral DMA. Is your buffer only 4 bytes long? >> Generally: >> >> - for sending data through a device that has its own, external, DMA >> engine, you'll obviously need to flush the data buffer(s) but also any >> DMA descriptors used by the engine, before you start the engine; >> >> - for rceiving, if you have to set up receive descriptors, you must >> flush that before telling the device to enter receive mode (so that the >> device reads the descriptors as you wrote them), and you should >> invalidate the receive buffers at the latest when the device signals >> that data has been received, > Hi All, > >> ? or preferably long before (at the same time >> you flushed the read descriptor, so that cache-related actions are >> grouped in the same place in the code). > I think this last statement is incorrect. You should invalidate the > cache for the receive buffers just before you intend to reference them. > If you do it right after receive mode is entered, subsequent access to > items NEAR the receive buffer may reload the cache with receive buffer > data before the dma is done, re-creating the problem you are trying to > avoid. Also, I don't know if ARM cache is write-back or write -thru, but > if it is write-back, the only way to avoid problems is to allocate the > receive buffers on cache line boundaries, so no "nearby" writes can > cause something in the DMA buffer to be corrupted. If all receive > buffers are allocated on cache-line boundaries (both start and end of > each buffer), you can invalidate the cache "early" under the assumption > that there will be no read accesses to the read buffers themselves until > after DMA is complete. IMHO it is better, even in this case., to > invalidate cache after dma is done but before referencing the read data. This is a critical observation, and one I was going to make if I had made it to the end of the thread and no one had already pointed it out. In fact, there is no way with the current implementation (that I can see) of the v7_dcache_inval_range function to correctly implement a cache enabled DMA driven driver without aligning buffers to cache line sizes. Below is the commit message from a recent patch I made locally to fix the Tegra MMC driver. I wanted to start a discussion on the list about forcing all buffers to be aligned to cache line sizes. The problem is that many buffers are stack allocated or part of structs that are of unknown alignment. mmc: Tegra2: Enable dcache support by bouncing unaligned requests. Dcache support was disabled due to dcache/dma interactions with unaligned buffers. When an unaligned buffer is used for DMA the first and last few bytes of the buffer will be clobbered by the dcache invalidate call that is required to make the contents of the buffer visible to the CPU post DMA. The reason that these bytes are clobbered is that the dcache invalidate code (which is the same as the linux kernels) checks for unaligned invalidates and first flushes the cache lines that are being partially invalidated. This is required because the cache lines may be shared with other variables, and may be dirty. This flush however writes over the first and last few bytes of the unaligned buffer with whatever happened to be in the cache. There are a number of possible solutions: 1) Modify the invalidate code to first read the partial cache line and then invalidate and then write back just the valid part of the line. This suffers from a race condition with concurrent code in interrupt handlers or other CPUs. 2) Modify all of U-Boot to allocate buffers from a block allocation mechanism that ensures they are alligned on cache line boundaries. While this is possible, there are some cases where the public API passes down a buffer pointer all the way to the block read interface. So all stand alone U-Boot programs would have to be fixed as well. 3) Use a bounce buffer that is known to be alligned. Allocate the buffer in the Tegra MMC code and ensure it is large enough for the read/write as well as aligned correctly. This solution requires an extra memcpy of the data read or written and has a high water mark on memory consumption. It can be conditionally used only when the buffer passed in is unaligned. This patch implements the third solution. -Anton ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-19 18:14 ` Anton Staaf @ 2011-07-19 20:11 ` J. William Campbell 2011-07-20 13:02 ` Albert ARIBAUD 2011-07-20 8:37 ` Aneesh V 1 sibling, 1 reply; 36+ messages in thread From: J. William Campbell @ 2011-07-19 20:11 UTC (permalink / raw) To: u-boot On 7/19/2011 11:14 AM, Anton Staaf wrote: > On Tue, Jul 19, 2011 at 7:36 AM, J. William Campbell > <jwilliamcampbell@comcast.net> wrote: >> On 7/19/2011 2:05 AM, Albert ARIBAUD wrote: >>> Le 19/07/2011 10:43, Aneesh V a ?crit : >>> >>>>>> You would have to flush (before sending packets / starting external >>>>>> memory-to-device DMA) and invalidate (before reading received packets / >>>>>> after external device-to-memory DMA is done); using MMU and mapping >>>>>> cached/non-cached areas is IMO overkill, and will hurt CPU accesses to >>>>>> the xmit/receive buffers and descriptors. >>>>> So, you say actually what I did while exploring the problem would have >>>>> been a >>>>> correct way of solving this problem? >>>>> >>>>> Like this: >>>>> >>>>> 587 flush_cache(&fec->tbd_base[fec->tbd_index], 4); >>>> This is what is needed assuming the below is initiating a memory to >>>> peripheral DMA. Is your buffer only 4 bytes long? >>> Generally: >>> >>> - for sending data through a device that has its own, external, DMA >>> engine, you'll obviously need to flush the data buffer(s) but also any >>> DMA descriptors used by the engine, before you start the engine; >>> >>> - for rceiving, if you have to set up receive descriptors, you must >>> flush that before telling the device to enter receive mode (so that the >>> device reads the descriptors as you wrote them), and you should >>> invalidate the receive buffers at the latest when the device signals >>> that data has been received, >> Hi All, >> >>> or preferably long before (at the same time >>> you flushed the read descriptor, so that cache-related actions are >>> grouped in the same place in the code). >> I think this last statement is incorrect. You should invalidate the >> cache for the receive buffers just before you intend to reference them. >> If you do it right after receive mode is entered, subsequent access to >> items NEAR the receive buffer may reload the cache with receive buffer >> data before the dma is done, re-creating the problem you are trying to >> avoid. Also, I don't know if ARM cache is write-back or write -thru, but >> if it is write-back, the only way to avoid problems is to allocate the >> receive buffers on cache line boundaries, so no "nearby" writes can >> cause something in the DMA buffer to be corrupted. If all receive >> buffers are allocated on cache-line boundaries (both start and end of >> each buffer), you can invalidate the cache "early" under the assumption >> that there will be no read accesses to the read buffers themselves until >> after DMA is complete. IMHO it is better, even in this case., to >> invalidate cache after dma is done but before referencing the read data. > This is a critical observation, and one I was going to make if I had > made it to the end of the thread and no one had already pointed it > out. In fact, there is no way with the current implementation (that I > can see) of the v7_dcache_inval_range function to correctly implement > a cache enabled DMA driven driver without aligning buffers to cache > line sizes. Below is the commit message from a recent patch I made > locally to fix the Tegra MMC driver. I wanted to start a discussion > on the list about forcing all buffers to be aligned to cache line > sizes. The problem is that many buffers are stack allocated or part > of structs that are of unknown alignment. > > mmc: Tegra2: Enable dcache support by bouncing unaligned requests. > > Dcache support was disabled due to dcache/dma interactions with > unaligned buffers. When an unaligned buffer is used for DMA the > first and last few bytes of the buffer will be clobbered by the > dcache invalidate call that is required to make the contents of > the buffer visible to the CPU post DMA. The reason that these > bytes are clobbered is that the dcache invalidate code (which is > the same as the linux kernels) checks for unaligned invalidates > and first flushes the cache lines that are being partially > invalidated. This is required because the cache lines may be > shared with other variables, and may be dirty. Hi Anton, > This flush however > writes over the first and last few bytes of the unaligned buffer > with whatever happened to be in the cache. If this is true, then it means that the cache is of type write-back (as opposed to write-thru). From a (very brief) look at the arm7 manuals, it appears that both types of cache may be present in the cpu. Do you know how this operates? If all that we have is write-back, you are correct. There is no safe way to do DMA to non-cache aligned buffers. This means both ends of the buffer must be cache aligned, so you may have to allocate it bigger than the transfer requires. For unfamiliar readers, write-back cache has only a single dirty bit for each cache line. When a dirty line is flushed, the cpu doesn't know which data changed, so it writes all locations back into memory, thus write-back. Write-thru cache updates cache on a write, and also schedules a write to memory immediately of the new data. That way, no writes are actually necessary to flush the cache. The downside is, many updates of the same memory location will produce a lot of main-memory writes that may not actually be required. Best Regards, Bill Campbell > > There are a number of possible solutions: > > 1) Modify the invalidate code to first read the partial cache line > and then invalidate and then write back just the valid part of the > line. This suffers from a race condition with concurrent code in > interrupt handlers or other CPUs. > > 2) Modify all of U-Boot to allocate buffers from a block allocation > mechanism that ensures they are alligned on cache line boundaries. > While this is possible, there are some cases where the public API > passes down a buffer pointer all the way to the block read interface. > So all stand alone U-Boot programs would have to be fixed as well. > > 3) Use a bounce buffer that is known to be alligned. Allocate the > buffer in the Tegra MMC code and ensure it is large enough for the > read/write as well as aligned correctly. This solution requires an > extra memcpy of the data read or written and has a high water mark > on memory consumption. It can be conditionally used only when the > buffer passed in is unaligned. > > This patch implements the third solution. > > -Anton > > ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-19 20:11 ` J. William Campbell @ 2011-07-20 13:02 ` Albert ARIBAUD [not found] ` <4E26DF9D.5070709@comcast.net> 0 siblings, 1 reply; 36+ messages in thread From: Albert ARIBAUD @ 2011-07-20 13:02 UTC (permalink / raw) To: u-boot Le 19/07/2011 22:11, J. William Campbell a ?crit : > If this is true, then it means that the cache is of type write-back (as > opposed to write-thru). From a (very brief) look at the arm7 manuals, it > appears that both types of cache may be present in the cpu. Do you know > how this operates? Usually, copyback (rather than writeback) and writethough are modes of operation, not cache types. Amicalement, -- Albert. ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <4E26DF9D.5070709@comcast.net>]
[parent not found: <4E26E7AA.9070001@aribaud.net>]
* [U-Boot] i.MX51: FEC: Cache coherency problem? [not found] ` <4E26E7AA.9070001@aribaud.net> @ 2011-07-20 15:36 ` J. William Campbell 2011-07-21 6:48 ` David Jander 0 siblings, 1 reply; 36+ messages in thread From: J. William Campbell @ 2011-07-20 15:36 UTC (permalink / raw) To: u-boot On 7/20/2011 7:35 AM, Albert ARIBAUD wrote: > Le 20/07/2011 16:01, J. William Campbell a ?crit : >> On 7/20/2011 6:02 AM, Albert ARIBAUD wrote: >>> Le 19/07/2011 22:11, J. William Campbell a ?crit : >>> >>>> If this is true, then it means that the cache is of type write-back >>>> (as >>>> opposed to write-thru). From a (very brief) look at the arm7 >>>> manuals, it >>>> appears that both types of cache may be present in the cpu. Do you >>>> know >>>> how this operates? >>> Usually, copyback (rather than writeback) and writethough are modes of >>> operation, not cache types. >> Hi Albert, >> One some CPUs both cache modes are available. On many other CPUs (I >> would guess most), you have one fixed mode available, but not both. I >> have always seen the two modes described as write-back and >> write-through, but I am sure we are talking about the same things. > > We are. Copy-back is another name for write-back, not used by ARM but > by some others. > >> The >> examples that have both modes that I am familiar with have the mode as a >> "global" setting. It is not controlled by bits in the TLB or anything >> like that. How does it work on ARM? Is it fixed, globally, globally >> controlled, or controlled by memory management? > > Well, it's a bit complicated, because it depends on the architecture > version *and* implementation -- ARM themselves do not mandate things, > and it is up to the SoC designer to specify what cache they want and > what mode it supports, both at L1 and L2, in their specific instance > of ARM cores. And yes, you can have memory areas that are write-back > and others that are write-through in the same system. > >> If it is controlled by memory management, it looks to me like lots of >> problems could be avoided by operating with input type buffers set as >> write-through. One probably isn't going to be writing to input buffers >> much under program control anyway, so the performance loss should be >> minimal. This gets rid of the alignment restrictions on these buffers >> but not the invalidate/flush requirements. > > There's not much you can do about alignment issues except align to > cache line boundaries. > >> However, if memory management >> is required to set the cache mode, it might be best to operate with the >> buffers and descriptors un-cached. That gets rid of the flush/invalidate >> requirement at the expense of slowing down copying from read buffers. > > That makes 'best' a subjective choice, doesn't it? :) Hi All, Yes,it probably depends on the usage. > >> Probably a reasonable price to pay for the associated simplicity. > > Others would say that spending some time setting up alignments and > flushes and invalidates is a reasonable price to pay for increased > performance... That's an open debate where no solution is The Right > One(tm). > > For instance, consider the TFTP image reading. People would like the > image to end up in cached memory because we'll do some checksumming on > it before we give it control, and having it cached makes this step > quite faster; but we'll lose that if we put it in non-cached memory > because it comes through the Ethernet controller's DMA; and it would > be worse to receive packets in non-cached memory only to move their > contents into cached memory later on. > > I think properly aligning descriptors and buffers is enough to avoid > the mixed flush/invalidate line issue, and wisely putting instruction > barriers should be enough to get the added performance of cache > without too much of the hassle of memory management. I am pretty sure that all the drivers read the input data into intermediate buffers in all cases. There is no practical way to be sure the next packet received is the "right one" for the tftp. Plus there are headers involved, and furthermore there is no way to ensure that a tftp destination is located on a sector boundary. In short, you are going to copy from an input buffer to a destination. However, it is still correct that copying from an non-cached area is slower than from cached areas, because of burst reads vs. individual reads. However, I doubt that the u-boot user can tell the difference, as the network latency will far exceed the difference in copy time. The question is, which is easier to do, and that is probably a matter of opinion. However, it is safe to say that so far a cached solution has eluded us. That may be changing, but it would still be nice to know how to allocate a section of un-cached RAM in the ARM processor, in so far as the question has a single answer! That would allow easy portability of drivers that do not know about caches, of which there seems to be many. Best Regards, Bill Campbell > >> Best Regards, >> Bill Campbell > > Amicalement, ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-20 15:36 ` J. William Campbell @ 2011-07-21 6:48 ` David Jander 2011-07-23 13:04 ` Albert ARIBAUD 0 siblings, 1 reply; 36+ messages in thread From: David Jander @ 2011-07-21 6:48 UTC (permalink / raw) To: u-boot On Wed, 20 Jul 2011 08:36:12 -0700 "J. William Campbell" <jwilliamcampbell@comcast.net> wrote: > On 7/20/2011 7:35 AM, Albert ARIBAUD wrote: > > Le 20/07/2011 16:01, J. William Campbell a ?crit : > >> On 7/20/2011 6:02 AM, Albert ARIBAUD wrote: > >>> Le 19/07/2011 22:11, J. William Campbell a ?crit : > >>> > >>>> If this is true, then it means that the cache is of type write-back > >>>> (as > >>>> opposed to write-thru). From a (very brief) look at the arm7 > >>>> manuals, it > >>>> appears that both types of cache may be present in the cpu. Do you > >>>> know > >>>> how this operates? > >>> Usually, copyback (rather than writeback) and writethough are modes of > >>> operation, not cache types. > >> Hi Albert, > >> One some CPUs both cache modes are available. On many other CPUs (I > >> would guess most), you have one fixed mode available, but not both. I > >> have always seen the two modes described as write-back and > >> write-through, but I am sure we are talking about the same things. > > > > We are. Copy-back is another name for write-back, not used by ARM but > > by some others. > > > >> The > >> examples that have both modes that I am familiar with have the mode as a > >> "global" setting. It is not controlled by bits in the TLB or anything > >> like that. How does it work on ARM? Is it fixed, globally, globally > >> controlled, or controlled by memory management? > > > > Well, it's a bit complicated, because it depends on the architecture > > version *and* implementation -- ARM themselves do not mandate things, > > and it is up to the SoC designer to specify what cache they want and > > what mode it supports, both at L1 and L2, in their specific instance > > of ARM cores. And yes, you can have memory areas that are write-back > > and others that are write-through in the same system. > > > >> If it is controlled by memory management, it looks to me like lots of > >> problems could be avoided by operating with input type buffers set as > >> write-through. One probably isn't going to be writing to input buffers > >> much under program control anyway, so the performance loss should be > >> minimal. This gets rid of the alignment restrictions on these buffers > >> but not the invalidate/flush requirements. > > > > There's not much you can do about alignment issues except align to > > cache line boundaries. > > > >> However, if memory management > >> is required to set the cache mode, it might be best to operate with the > >> buffers and descriptors un-cached. That gets rid of the flush/invalidate > >> requirement at the expense of slowing down copying from read buffers. > > > > That makes 'best' a subjective choice, doesn't it? :) > Hi All, > Yes,it probably depends on the usage. > > > >> Probably a reasonable price to pay for the associated simplicity. > > > > Others would say that spending some time setting up alignments and > > flushes and invalidates is a reasonable price to pay for increased > > performance... That's an open debate where no solution is The Right > > One(tm). > > > > For instance, consider the TFTP image reading. People would like the > > image to end up in cached memory because we'll do some checksumming on > > it before we give it control, and having it cached makes this step > > quite faster; but we'll lose that if we put it in non-cached memory > > because it comes through the Ethernet controller's DMA; and it would > > be worse to receive packets in non-cached memory only to move their > > contents into cached memory later on. > > > > I think properly aligning descriptors and buffers is enough to avoid > > the mixed flush/invalidate line issue, and wisely putting instruction > > barriers should be enough to get the added performance of cache > > without too much of the hassle of memory management. > I am pretty sure that all the drivers read the input data into > intermediate buffers in all cases. There is no practical way to be sure > the next packet received is the "right one" for the tftp. Plus there are > headers involved, and furthermore there is no way to ensure that a tftp > destination is located on a sector boundary. In short, you are going to > copy from an input buffer to a destination. > However, it is still correct that copying from an non-cached area is > slower than from cached areas, because of burst reads vs. individual > reads. However, I doubt that the u-boot user can tell the difference, as > the network latency will far exceed the difference in copy time. The > question is, which is easier to do, and that is probably a matter of > opinion. However, it is safe to say that so far a cached solution has > eluded us. That may be changing, but it would still be nice to know how > to allocate a section of un-cached RAM in the ARM processor, in so far > as the question has a single answer! That would allow easy portability > of drivers that do not know about caches, of which there seems to be many. I agree. Unfortunately, my time is up for now, and I can't go on with trying to fix this driver. Maybe I'll pick up after my vacation. As for now I settled for the ugly solution of keeping dcache disabled while ethernet is being used :-( IMHO, doing cache maintenance all over the driver is not an easy or nice solution. Implementing a non-cached memory pool in the MMU and a corresponding dma_malloc() sounds like much more universally applicable to any driver. Best regards, -- David Jander Protonic Holland. ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-21 6:48 ` David Jander @ 2011-07-23 13:04 ` Albert ARIBAUD 2011-07-23 15:35 ` J. William Campbell 0 siblings, 1 reply; 36+ messages in thread From: Albert ARIBAUD @ 2011-07-23 13:04 UTC (permalink / raw) To: u-boot Le 21/07/2011 08:48, David Jander a ?crit : >> However, it is still correct that copying from an non-cached area is >> slower than from cached areas, because of burst reads vs. individual >> reads. However, I doubt that the u-boot user can tell the difference, as >> the network latency will far exceed the difference in copy time. That's assuming cache is only for networking. There can be DMA engines in a lot of other peripherals which do not have the same latency as network (and then, even for networking, TFTP can be done from a very nearby server, possibly even on the same Ethernet segment). >> The >> question is, which is easier to do, and that is probably a matter of >> opinion. However, it is safe to say that so far a cached solution has >> eluded us. That may be changing, but it would still be nice to know how >> to allocate a section of un-cached RAM in the ARM processor, in so far >> as the question has a single answer! That would allow easy portability >> of drivers that do not know about caches, of which there seems to be many. That is one approach, which I think prevents cache from being used beyond caching pure CPU-used DRAM. > I agree. Unfortunately, my time is up for now, and I can't go on with trying > to fix this driver. Maybe I'll pick up after my vacation. > As for now I settled for the ugly solution of keeping dcache disabled while > ethernet is being used :-( Make sure you flush before disabling. :) > IMHO, doing cache maintenance all over the driver is not an easy or nice > solution. Implementing a non-cached memory pool in the MMU and a corresponding > dma_malloc() sounds like much more universally applicable to any driver. I think cache maintenance is feasible if one makes sure the cached areas used by the driver are properly aligned, which simplifies things a lot: you don't have to care for flush-invalidate or just-in-time invalidate, you just have to flush before sending and invalidate before reading. > Best regards, Amicalement, -- Albert. ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-23 13:04 ` Albert ARIBAUD @ 2011-07-23 15:35 ` J. William Campbell 0 siblings, 0 replies; 36+ messages in thread From: J. William Campbell @ 2011-07-23 15:35 UTC (permalink / raw) To: u-boot On 7/23/2011 6:04 AM, Albert ARIBAUD wrote: > Le 21/07/2011 08:48, David Jander a ?crit : > >>> However, it is still correct that copying from an non-cached area is >>> slower than from cached areas, because of burst reads vs. individual >>> reads. However, I doubt that the u-boot user can tell the >>> difference, as >>> the network latency will far exceed the difference in copy time. > > That's assuming cache is only for networking. There can be DMA engines > in a lot of other peripherals which do not have the same latency as > network (and then, even for networking, TFTP can be done from a very > nearby server, possibly even on the same Ethernet segment). Hi All, Yes, there are other uses of DMA. On a network, unless you have a Gigabit network, your memory access speed is at least an order of magnitude faster than the network, probably more. Plus, there is a latency due to sending the ack and request for the next record that undoubtedly swamps out any reduction in memory speed due to the single copy that takes place. In the case of other devices, like for example disks, the percentage effect is probably greater, but since these devices are so fast anyway that the human-perceived speed reduction is essentially nil. If we were talking about a CPU running Linux and doing all kinds of I/O all day long, the reduction in throughput performance might be 10% and that might matter. In a boot loader that does I/O mostly to read in a program to replace itself, I would argue that nobody will notice the difference between cached and un-cached buffers. Counter-examples welcome however. > >>> The >>> question is, which is easier to do, and that is probably a matter of >>> opinion. However, it is safe to say that so far a cached solution has >>> eluded us. That may be changing, but it would still be nice to know how >>> to allocate a section of un-cached RAM in the ARM processor, in so far >>> as the question has a single answer! That would allow easy portability >>> of drivers that do not know about caches, of which there seems to be >>> many. > > That is one approach, which I think prevents cache from being used > beyond caching pure CPU-used DRAM. You are certainly correct there. However, I think the pure CPU-used ram case is the one that matters most. Uncompressing and checksumming of input data are typical u-boot functions that take significant time. The performance increase due to cache hits in these cases is huge, and easily perceptible by the user. > >> I agree. Unfortunately, my time is up for now, and I can't go on with >> trying >> to fix this driver. Maybe I'll pick up after my vacation. >> As for now I settled for the ugly solution of keeping dcache disabled >> while >> ethernet is being used :-( > > Make sure you flush before disabling. :) > >> IMHO, doing cache maintenance all over the driver is not an easy or nice >> solution. Implementing a non-cached memory pool in the MMU and a >> corresponding >> dma_malloc() sounds like much more universally applicable to any driver. > > I think cache maintenance is feasible if one makes sure the cached > areas used by the driver are properly aligned, which simplifies things > a lot: you don't have to care for flush-invalidate or just-in-time > invalidate, you just have to flush before sending and invalidate > before reading. I do agree it can be done. However, most (I think?) of the CPUs to which u-boot have been ported have cache-coherent DMA. As a result, cache issues for these CPUs are not addressed in the driver at all. Often this means that cache support is done after the fact by somebody other than the original author who may not totally understand the original driver. If DMA buffers were always allocated from cache-coherent memory, either because the memory is un-cached or because the CPU is DMA cache coherent, no changes would be necessary to get the driver working correctly. If performance ever became an issue in the un-cached case, then more work would be required, but in most cases, I expect nobody will notice. Best Regards, Bill Campbell > >> Best regards, > > Amicalement, ^ permalink raw reply [flat|nested] 36+ messages in thread
* [U-Boot] i.MX51: FEC: Cache coherency problem? 2011-07-19 18:14 ` Anton Staaf 2011-07-19 20:11 ` J. William Campbell @ 2011-07-20 8:37 ` Aneesh V 1 sibling, 0 replies; 36+ messages in thread From: Aneesh V @ 2011-07-20 8:37 UTC (permalink / raw) To: u-boot Hi Anton, On Tuesday 19 July 2011 11:44 PM, Anton Staaf wrote: [snip ..] > There are a number of possible solutions: > > 1) Modify the invalidate code to first read the partial cache line > and then invalidate and then write back just the valid part of the > line. This suffers from a race condition with concurrent code in > interrupt handlers or other CPUs. How do you propose to implement this? Are you suggesting something like momentarily turning off D-cache, read the memory content into registers, enable cache again, mix the memory content in registers with the cache content and then flush, or something like that? I am afraid this will be way too complex if at all viable. I don't see any other easy option. Am I missing something? best regards, Aneesh ^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2011-07-23 15:35 UTC | newest]
Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-18 15:18 [U-Boot] i.MX51: FEC: Cache coherency problem? David Jander
2011-07-18 16:16 ` Aneesh V
2011-07-19 7:26 ` David Jander
2011-07-19 11:07 ` Matthias Weißer
2011-07-19 11:17 ` David Jander
2011-07-19 11:20 ` Wolfgang Denk
2011-07-19 12:10 ` David Jander
2011-07-20 6:29 ` David Jander
2011-07-20 8:56 ` Albert ARIBAUD
2011-07-20 9:21 ` David Jander
2011-07-20 10:29 ` Aneesh V
2011-07-20 11:31 ` David Jander
2011-07-20 12:05 ` Aneesh V
2011-07-19 11:19 ` Wolfgang Denk
2011-07-19 14:31 ` Matthias Weißer
2011-07-19 11:51 ` Aneesh V
2011-07-18 16:55 ` Stefano Babic
2011-07-19 7:44 ` David Jander
2011-07-19 8:21 ` Albert ARIBAUD
2011-07-19 8:37 ` David Jander
2011-07-19 8:43 ` Aneesh V
2011-07-19 8:58 ` David Jander
2011-07-19 9:11 ` Albert ARIBAUD
2011-07-19 11:50 ` Aneesh V
2011-07-19 11:42 ` Aneesh V
2011-07-19 9:05 ` Albert ARIBAUD
2011-07-19 14:36 ` J. William Campbell
2011-07-19 15:17 ` David Jander
2011-07-19 18:14 ` Anton Staaf
2011-07-19 20:11 ` J. William Campbell
2011-07-20 13:02 ` Albert ARIBAUD
[not found] ` <4E26DF9D.5070709@comcast.net>
[not found] ` <4E26E7AA.9070001@aribaud.net>
2011-07-20 15:36 ` J. William Campbell
2011-07-21 6:48 ` David Jander
2011-07-23 13:04 ` Albert ARIBAUD
2011-07-23 15:35 ` J. William Campbell
2011-07-20 8:37 ` Aneesh V
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox