[U-Boot] i.MX51: FEC: Cache coherency problem?

public inbox for u-boot@lists.denx.de
 help / color / mirror / Atom feed

* [U-Boot] i.MX51: FEC: Cache coherency problem?
@ 2011-07-18 15:18 David Jander
  2011-07-18 16:16 ` Aneesh V
  2011-07-18 16:55 ` Stefano Babic
  0 siblings, 2 replies; 36+ messages in thread
From: David Jander @ 2011-07-18 15:18 UTC (permalink / raw)
  To: u-boot

Hi all,

I am busy debugging a problem with the i.MX51 FEC ethernet driver, that
stopped working after upgrading u-boot. Before the upgrade I used
v2010.06-rc3, which worked fine.
I gave up on trying to find the difference beween the old version and this one
that broke it.... due to BSP issues, git bisecting seems a monumental task I
am not considering yet.
The funny part is that it seems to work fine if I disable data-caches!
With data caches enabled, it hangs in the following while loop in fec_send(),
at line 592:

...
584         /*
585          * Enable SmartDMA transmit task
586          */
587         fec_tx_task_enable(fec);
588 
589         /*
590          * wait until frame is sent .
591          */
592         while (readw(&fec->tbd_base[fec->tbd_index].status) & FEC_TBD_READY) {
593                 udelay(1);
594         }
...

If I change this code in the following way, the while loop exits successfully:

...
584         /*
585          * Enable SmartDMA transmit task
586          */
587         flush_cache(&fec->tbd_base[fec->tbd_index], 4);
588         fec_tx_task_enable(fec);
589         flash_dcache_all();
590 
591         /*
592          * wait until frame is sent .
593          */
594         while (readw(&fec->tbd_base[fec->tbd_index].status) & FEC_TBD_READY) {
595                 udelay(1);
596         }
...

Notice the cache flush calls at line 587 and 589. With these, sending
succeeds. The driver still hangs in receiving afterwards, but at least this
part seems to work. If I remove either of the two added lines, it stops
working again.

What is going on here? Why did this work with caches enabled before??

Best regards,

-- 
David Jander
Protonic Holland.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-18 15:18 [U-Boot] i.MX51: FEC: Cache coherency problem? David Jander
@ 2011-07-18 16:16 ` Aneesh V
  2011-07-19  7:26   ` David Jander
  2011-07-19 11:07   ` Matthias Weißer
  2011-07-18 16:55 ` Stefano Babic
  1 sibling, 2 replies; 36+ messages in thread
From: Aneesh V @ 2011-07-18 16:16 UTC (permalink / raw)
  To: u-boot

Hi David,

On Monday 18 July 2011 08:48 PM, David Jander wrote:
>
> Hi all,
>
> I am busy debugging a problem with the i.MX51 FEC ethernet driver, that
> stopped working after upgrading u-boot. Before the upgrade I used
> v2010.06-rc3, which worked fine.
> I gave up on trying to find the difference beween the old version and this one
> that broke it.... due to BSP issues, git bisecting seems a monumental task I
> am not considering yet.
> The funny part is that it seems to work fine if I disable data-caches!
> With data caches enabled, it hangs in the following while loop in fec_send(),
> at line 592:
>
> ...
> 584         /*
> 585          * Enable SmartDMA transmit task
> 586          */
> 587         fec_tx_task_enable(fec);
> 588
> 589         /*
> 590          * wait until frame is sent .
> 591          */
> 592         while (readw(&fec->tbd_base[fec->tbd_index].status)&  FEC_TBD_READY) {
> 593                 udelay(1);
> 594         }
> ...
>
> If I change this code in the following way, the while loop exits successfully:
>
> ...
> 584         /*
> 585          * Enable SmartDMA transmit task
> 586          */
> 587         flush_cache(&fec->tbd_base[fec->tbd_index], 4);
> 588         fec_tx_task_enable(fec);
> 589         flash_dcache_all();
> 590
> 591         /*
> 592          * wait until frame is sent .
> 593          */
> 594         while (readw(&fec->tbd_base[fec->tbd_index].status)&  FEC_TBD_READY) {
> 595                 udelay(1);
> 596         }
> ...
>
> Notice the cache flush calls at line 587 and 589. With these, sending
> succeeds. The driver still hangs in receiving afterwards, but at least this
> part seems to work. If I remove either of the two added lines, it stops
> working again.
>
> What is going on here? Why did this work with caches enabled before??

Are you sure caches were enabled before? Until recently absence of
CONFIG_SYS_DCACHE_OFF did not mean that your D-cache is enabled because
you also had to call the function dcache_enable(). Some platforms were
calling this from board file or some drivers but not all.

The following patch changed this for ARM:

commit c2dd0d45540397704de9b13287417d21049d34c6
armv7: integrate cache maintenance support

In this patch I added a call to dcache_enable() at the beginning of
board_init_r() for ARM(i.e. as soon as relocation is over). As a result
D-cache will be enabled for all ARM platforms now unless
CONFIG_SYS_DCACHE_OFF is set.

Looks like this is a real coherency issue that is brought out because
d-cache is really enabled for you now.

best regards,
Aneesh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-18 16:16 ` Aneesh V
@ 2011-07-19  7:26   ` David Jander
  2011-07-19 11:07   ` Matthias Weißer
  1 sibling, 0 replies; 36+ messages in thread
From: David Jander @ 2011-07-19  7:26 UTC (permalink / raw)
  To: u-boot

On Mon, 18 Jul 2011 21:46:28 +0530
Aneesh V <aneesh@ti.com> wrote:

> Hi David,
> 
> On Monday 18 July 2011 08:48 PM, David Jander wrote:
> >
> > Hi all,
> >
> > I am busy debugging a problem with the i.MX51 FEC ethernet driver, that
> > stopped working after upgrading u-boot. Before the upgrade I used
> > v2010.06-rc3, which worked fine.
> > I gave up on trying to find the difference beween the old version and this
> > one that broke it.... due to BSP issues, git bisecting seems a monumental
> > task I am not considering yet.
> > The funny part is that it seems to work fine if I disable data-caches!
> > With data caches enabled, it hangs in the following while loop in
> > fec_send(), at line 592:
> >
> > ...
> > 584         /*
> > 585          * Enable SmartDMA transmit task
> > 586          */
> > 587         fec_tx_task_enable(fec);
> > 588
> > 589         /*
> > 590          * wait until frame is sent .
> > 591          */
> > 592         while (readw(&fec->tbd_base[fec->tbd_index].status)&
> > FEC_TBD_READY) { 593                 udelay(1);
> > 594         }
> > ...
> >
> > If I change this code in the following way, the while loop exits
> > successfully:
> >
> > ...
> > 584         /*
> > 585          * Enable SmartDMA transmit task
> > 586          */
> > 587         flush_cache(&fec->tbd_base[fec->tbd_index], 4);
> > 588         fec_tx_task_enable(fec);
> > 589         flash_dcache_all();
> > 590
> > 591         /*
> > 592          * wait until frame is sent .
> > 593          */
> > 594         while (readw(&fec->tbd_base[fec->tbd_index].status)&
> > FEC_TBD_READY) { 595                 udelay(1);
> > 596         }
> > ...
> >
> > Notice the cache flush calls at line 587 and 589. With these, sending
> > succeeds. The driver still hangs in receiving afterwards, but at least this
> > part seems to work. If I remove either of the two added lines, it stops
> > working again.
> >
> > What is going on here? Why did this work with caches enabled before??
> 
> Are you sure caches were enabled before? Until recently absence of
> CONFIG_SYS_DCACHE_OFF did not mean that your D-cache is enabled because
> you also had to call the function dcache_enable(). Some platforms were
> calling this from board file or some drivers but not all.

We did. I am always calling dcache_enable(), icache_enable() and
l2_cache_enable() from board code. The latter function I implemented myself,
because nor u-boot nor the kernel does otherwise activate l2-cache, which has a
big performance impact.

Just in case, at the moment I am testing with only the L1 caches enabled.

> The following patch changed this for ARM:
> 
> commit c2dd0d45540397704de9b13287417d21049d34c6
> armv7: integrate cache maintenance support
> 
> In this patch I added a call to dcache_enable() at the beginning of
> board_init_r() for ARM(i.e. as soon as relocation is over). As a result
> D-cache will be enabled for all ARM platforms now unless
> CONFIG_SYS_DCACHE_OFF is set.

Yes, I noticed that.

> Looks like this is a real coherency issue that is brought out because
> d-cache is really enabled for you now.

AFAICS, it was always enabled for our board. At least I did always call
i/dcache_enable() in board setup code.... was it broken before? Are the L1
caches enabled by (mainline) linux code?

Best regards,

-- 
David Jander
Protonic Holland.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-18 16:16 ` Aneesh V
  2011-07-19  7:26   ` David Jander
@ 2011-07-19 11:07   ` Matthias Weißer
  2011-07-19 11:17     ` David Jander
                       ` (2 more replies)
  1 sibling, 3 replies; 36+ messages in thread
From: Matthias Weißer @ 2011-07-19 11:07 UTC (permalink / raw)
  To: u-boot

Dear Aneesh

Am 18.07.2011 18:16, schrieb Aneesh V:
> commit c2dd0d45540397704de9b13287417d21049d34c6
> armv7: integrate cache maintenance support
>
> In this patch I added a call to dcache_enable() at the beginning of
> board_init_r() for ARM(i.e. as soon as relocation is over). As a result
> D-cache will be enabled for all ARM platforms now unless
> CONFIG_SYS_DCACHE_OFF is set.

Is this really a good idea? This will break a couple of boards using 
non-cache-aware drivers. And there are a couple of them in u-boot. I 
think d-cache should be opt-in rather then opt-out as long as there are 
any drivers which didn't handle cached memory regions correct. i-cache 
is much less problematic and can be enabled by default.

If d-cache will be enabled by default on ARM I think I have to send a 
patch for one of my boards :-)

Regards
Matthias

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-19 11:07   ` Matthias Weißer
@ 2011-07-19 11:17     ` David Jander
  2011-07-19 11:20       ` Wolfgang Denk
  2011-07-19 11:19     ` Wolfgang Denk
  2011-07-19 11:51     ` Aneesh V
  2 siblings, 1 reply; 36+ messages in thread
From: David Jander @ 2011-07-19 11:17 UTC (permalink / raw)
  To: u-boot

On Tue, 19 Jul 2011 13:07:52 +0200
Matthias Wei?er <weisserm@arcor.de> wrote:

> Dear Aneesh
> 
> Am 18.07.2011 18:16, schrieb Aneesh V:
> > commit c2dd0d45540397704de9b13287417d21049d34c6
> > armv7: integrate cache maintenance support
> >
> > In this patch I added a call to dcache_enable() at the beginning of
> > board_init_r() for ARM(i.e. as soon as relocation is over). As a result
> > D-cache will be enabled for all ARM platforms now unless
> > CONFIG_SYS_DCACHE_OFF is set.
> 
> Is this really a good idea? This will break a couple of boards using 
> non-cache-aware drivers. And there are a couple of them in u-boot. I 
> think d-cache should be opt-in rather then opt-out as long as there are 
> any drivers which didn't handle cached memory regions correct. i-cache 
> is much less problematic and can be enabled by default.
> 
> If d-cache will be enabled by default on ARM I think I have to send a 
> patch for one of my boards :-)

Exactly the issue I am having now... the fact that I was unaware of this patch
and generally don't have much experience with solving cache-related issues
apparently, made me waste a lot of time debugging the fec_mxc.c driver while
the driver itself wasn't really broken (at least not more than before).
Now I finally know what's wrong and am working on a proposed fix to make this
one driver cache-aware.

Best regards,

-- 
David Jander
Protonic Holland.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-19 11:17     ` David Jander
@ 2011-07-19 11:20       ` Wolfgang Denk
  2011-07-19 12:10         ` David Jander
  0 siblings, 1 reply; 36+ messages in thread
From: Wolfgang Denk @ 2011-07-19 11:20 UTC (permalink / raw)
  To: u-boot

Dear David Jander,

In message <20110719131744.403a81e6@archvile> you wrote:
> 
> Now I finally know what's wrong and am working on a proposed fix to make this
> one driver cache-aware.

I would just like to point out that these efforts are highly
appreciated!

Thanks!


Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd at denx.de
Chapter 1 -- The story so  far:
In the beginning the Universe was created. This has  made  a  lot  of
people very angry and been widely regarded as a bad move.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-19 11:20       ` Wolfgang Denk
@ 2011-07-19 12:10         ` David Jander
  2011-07-20  6:29           ` David Jander
  0 siblings, 1 reply; 36+ messages in thread
From: David Jander @ 2011-07-19 12:10 UTC (permalink / raw)
  To: u-boot

On Tue, 19 Jul 2011 13:20:26 +0200
Wolfgang Denk <wd@denx.de> wrote:

> Dear David Jander,
> 
> In message <20110719131744.403a81e6@archvile> you wrote:
> > 
> > Now I finally know what's wrong and am working on a proposed fix to make
> > this one driver cache-aware.
> 
> I would just like to point out that these efforts are highly
> appreciated!

Thanks.
I'll try my best, but I am running out of time, and it is still not as it
should. I still have trouble identifying the right place where receive buffer
descriptors should be cache-invalidated. At one point, a magic
flush_dcache_all() at a certain place in code made it work, but that can't be
entirely correct, and if I replace it with only the necessary ranges, it
doesn't work correctly anymore, so I guess this flush is also invalidating
something I need elsewhere :-(
I keep searching.... through the IMO fairly convoluted network code.

Best regards,

-- 
David Jander
Protonic Holland.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-19 12:10         ` David Jander
@ 2011-07-20  6:29           ` David Jander
  2011-07-20  8:56             ` Albert ARIBAUD
  0 siblings, 1 reply; 36+ messages in thread
From: David Jander @ 2011-07-20  6:29 UTC (permalink / raw)
  To: u-boot

On Tue, 19 Jul 2011 14:10:48 +0200
David Jander <david.jander@protonic.nl> wrote:

> On Tue, 19 Jul 2011 13:20:26 +0200
> Wolfgang Denk <wd@denx.de> wrote:
> 
> > Dear David Jander,
> > 
> > In message <20110719131744.403a81e6@archvile> you wrote:
> > > 
> > > Now I finally know what's wrong and am working on a proposed fix to make
> > > this one driver cache-aware.
> > 
> > I would just like to point out that these efforts are highly
> > appreciated!
> 
> Thanks.
> I'll try my best, but I am running out of time, and it is still not as it
> should. I still have trouble identifying the right place where receive buffer
> descriptors should be cache-invalidated. At one point, a magic
> flush_dcache_all() at a certain place in code made it work, but that can't be
> entirely correct, and if I replace it with only the necessary ranges, it
> doesn't work correctly anymore, so I guess this flush is also invalidating
> something I need elsewhere :-(
> I keep searching.... through the IMO fairly convoluted network code.

Ok, that was not such a problem. I found few nice spots, where
cache-maintenance code should go IMO:

In fec_rx_task_enable() cleaned RX BD's should be flushed.
In fec_tx_task_enable() dirty TX BD's should be flushed, as well as data
buffers.
In fec_send() Invalidate BD status fields before reading, also in the while
loop.
In fec_recv() Invalidate BD and data buffers just before every read.

And of course align all buffers to cache-line size and pad around the end of
each buffer to cache-line size.

Problem: It doesn't work (tftp transfer stops after a few packets)! :-(
I am giving up!

Plan B: I'll now try to allocate all buffers in internal RAM (which is not
cached AFAIK).

Any ideas?

Best regards,

-- 
David Jander
Protonic Holland.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-20  6:29           ` David Jander
@ 2011-07-20  8:56             ` Albert ARIBAUD
  2011-07-20  9:21               ` David Jander
  0 siblings, 1 reply; 36+ messages in thread
From: Albert ARIBAUD @ 2011-07-20  8:56 UTC (permalink / raw)
  To: u-boot

Hi David,

Le 20/07/2011 08:29, David Jander a ?crit :
> On Tue, 19 Jul 2011 14:10:48 +0200
> David Jander<david.jander@protonic.nl>  wrote:
>
>> On Tue, 19 Jul 2011 13:20:26 +0200
>> Wolfgang Denk<wd@denx.de>  wrote:
>>
>>> Dear David Jander,
>>>
>>> In message<20110719131744.403a81e6@archvile>  you wrote:
>>>>
>>>> Now I finally know what's wrong and am working on a proposed fix to make
>>>> this one driver cache-aware.
>>>
>>> I would just like to point out that these efforts are highly
>>> appreciated!
>>
>> Thanks.
>> I'll try my best, but I am running out of time, and it is still not as it
>> should. I still have trouble identifying the right place where receive buffer
>> descriptors should be cache-invalidated. At one point, a magic
>> flush_dcache_all() at a certain place in code made it work, but that can't be
>> entirely correct, and if I replace it with only the necessary ranges, it
>> doesn't work correctly anymore, so I guess this flush is also invalidating
>> something I need elsewhere :-(
>> I keep searching.... through the IMO fairly convoluted network code.
>
> Ok, that was not such a problem. I found few nice spots, where
> cache-maintenance code should go IMO:
>
> In fec_rx_task_enable() cleaned RX BD's should be flushed.
> In fec_tx_task_enable() dirty TX BD's should be flushed, as well as data
> buffers.
> In fec_send() Invalidate BD status fields before reading, also in the while
> loop.
> In fec_recv() Invalidate BD and data buffers just before every read.
>
> And of course align all buffers to cache-line size and pad around the end of
> each buffer to cache-line size.
>
> Problem: It doesn't work (tftp transfer stops after a few packets)! :-(
> I am giving up!
>
> Plan B: I'll now try to allocate all buffers in internal RAM (which is not
> cached AFAIK).
>
> Any ideas?

Yes, one: I had issues with the Marvell Ethernet adapter, which has DMA 
as well, not because of cache (it was not active at the time) but 
because of instruction reordering done by the compiler when optimizing, 
which resulted in out-of-order accesses to the descritpors and DMA 
registered, so that the DMA start was written before the descriptors 
were set up. The (proper and clean) solution was to introduce memory 
barriers at strategic points of the driver to ensure that specific DMA 
starts were done only after all writes to the descriptors (and possibly 
buffers).

See drivers/net/mvgbe.c, and llok for 'isb()' instructions.

> Best regards,

Amicalement,
-- 
Albert.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-20  8:56             ` Albert ARIBAUD
@ 2011-07-20  9:21               ` David Jander
  2011-07-20 10:29                 ` Aneesh V
  0 siblings, 1 reply; 36+ messages in thread
From: David Jander @ 2011-07-20  9:21 UTC (permalink / raw)
  To: u-boot

On Wed, 20 Jul 2011 10:56:07 +0200
Albert ARIBAUD <albert.u.boot@aribaud.net> wrote:

> Hi David,
> 
> Le 20/07/2011 08:29, David Jander a ?crit :
> > On Tue, 19 Jul 2011 14:10:48 +0200
> > David Jander<david.jander@protonic.nl>  wrote:
> >
> >> On Tue, 19 Jul 2011 13:20:26 +0200
> >> Wolfgang Denk<wd@denx.de>  wrote:
> >>
> >>> Dear David Jander,
> >>>
> >>> In message<20110719131744.403a81e6@archvile>  you wrote:
> >>>>
> >>>> Now I finally know what's wrong and am working on a proposed fix to make
> >>>> this one driver cache-aware.
> >>>
> >>> I would just like to point out that these efforts are highly
> >>> appreciated!
> >>
> >> Thanks.
> >> I'll try my best, but I am running out of time, and it is still not as it
> >> should. I still have trouble identifying the right place where receive
> >> buffer descriptors should be cache-invalidated. At one point, a magic
> >> flush_dcache_all() at a certain place in code made it work, but that
> >> can't be entirely correct, and if I replace it with only the necessary
> >> ranges, it doesn't work correctly anymore, so I guess this flush is also
> >> invalidating something I need elsewhere :-(
> >> I keep searching.... through the IMO fairly convoluted network code.
> >
> > Ok, that was not such a problem. I found few nice spots, where
> > cache-maintenance code should go IMO:
> >
> > In fec_rx_task_enable() cleaned RX BD's should be flushed.
> > In fec_tx_task_enable() dirty TX BD's should be flushed, as well as data
> > buffers.
> > In fec_send() Invalidate BD status fields before reading, also in the while
> > loop.
> > In fec_recv() Invalidate BD and data buffers just before every read.
> >
> > And of course align all buffers to cache-line size and pad around the end
> > of each buffer to cache-line size.
> >
> > Problem: It doesn't work (tftp transfer stops after a few packets)! :-(
> > I am giving up!
> >
> > Plan B: I'll now try to allocate all buffers in internal RAM (which is not
> > cached AFAIK).
> >
> > Any ideas?
> 
> Yes, one: I had issues with the Marvell Ethernet adapter, which has DMA 
> as well, not because of cache (it was not active at the time) but 
> because of instruction reordering done by the compiler when optimizing, 
> which resulted in out-of-order accesses to the descritpors and DMA 
> registered, so that the DMA start was written before the descriptors 
> were set up. The (proper and clean) solution was to introduce memory 
> barriers at strategic points of the driver to ensure that specific DMA 
> starts were done only after all writes to the descriptors (and possibly 
> buffers).

Hmmm. I know that issue, but AFAICS, the driver already uses readX() and
writeX() macros when accessing register and DMA memory. Those macros have
compiler barriers included.... and armv7 is still in-order execution ;-)
Thanks for the suggestion anyway :-)

Best regards,

-- 
David Jander
Protonic Holland.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-20  9:21               ` David Jander
@ 2011-07-20 10:29                 ` Aneesh V
  2011-07-20 11:31                   ` David Jander
  0 siblings, 1 reply; 36+ messages in thread
From: Aneesh V @ 2011-07-20 10:29 UTC (permalink / raw)
  To: u-boot

Hi David,

On Wednesday 20 July 2011 02:51 PM, David Jander wrote:
> On Wed, 20 Jul 2011 10:56:07 +0200

[snip ..]

>>> Any ideas?
>>
>> Yes, one: I had issues with the Marvell Ethernet adapter, which has DMA
>> as well, not because of cache (it was not active at the time) but
>> because of instruction reordering done by the compiler when optimizing,
>> which resulted in out-of-order accesses to the descritpors and DMA
>> registered, so that the DMA start was written before the descriptors
>> were set up. The (proper and clean) solution was to introduce memory
>> barriers at strategic points of the driver to ensure that specific DMA
>> starts were done only after all writes to the descriptors (and possibly
>> buffers).
>
> Hmmm. I know that issue, but AFAICS, the driver already uses readX() and
> writeX() macros when accessing register and DMA memory. Those macros have
> compiler barriers included.... and armv7 is still in-order execution ;-)

1. armv7 is not necessarily in-order. Cortex-A8 is in-order while
Cortex-A9 is out-of-order.
2. I am not sure if in-order execution guarantees memory ordering. My
understanding is that it doesn't.
3. In u-boot's implementation of MMU for ARM, the register
space(everything other than SDRAM) is 'strongly-ordered' memory.
Strongly ordered accesses will be correctly ordered w.r.to all other
accesses in program order.
4. Compiler barriers prevent the compiler from doing certain
optimizations, but doesn't help in these situations.

In short, I think you don't need to worry about ordering, but not
because of compiler barriers or armv7 being in-order.

br,
Aneesh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-20 10:29                 ` Aneesh V
@ 2011-07-20 11:31                   ` David Jander
  2011-07-20 12:05                     ` Aneesh V
  0 siblings, 1 reply; 36+ messages in thread
From: David Jander @ 2011-07-20 11:31 UTC (permalink / raw)
  To: u-boot


Hi Aneesh,

On Wed, 20 Jul 2011 15:59:42 +0530
Aneesh V <aneesh@ti.com> wrote:
> On Wednesday 20 July 2011 02:51 PM, David Jander wrote:
> > On Wed, 20 Jul 2011 10:56:07 +0200
> 
> [snip ..]
> 
> >>> Any ideas?
> >>
> >> Yes, one: I had issues with the Marvell Ethernet adapter, which has DMA
> >> as well, not because of cache (it was not active at the time) but
> >> because of instruction reordering done by the compiler when optimizing,
> >> which resulted in out-of-order accesses to the descritpors and DMA
> >> registered, so that the DMA start was written before the descriptors
> >> were set up. The (proper and clean) solution was to introduce memory
> >> barriers at strategic points of the driver to ensure that specific DMA
> >> starts were done only after all writes to the descriptors (and possibly
> >> buffers).
> >
> > Hmmm. I know that issue, but AFAICS, the driver already uses readX() and
> > writeX() macros when accessing register and DMA memory. Those macros have
> > compiler barriers included.... and armv7 is still in-order execution ;-)
> 
> 1. armv7 is not necessarily in-order. Cortex-A8 is in-order while
> Cortex-A9 is out-of-order.
> 2. I am not sure if in-order execution guarantees memory ordering. My
> understanding is that it doesn't.
> 3. In u-boot's implementation of MMU for ARM, the register
> space(everything other than SDRAM) is 'strongly-ordered' memory.
> Strongly ordered accesses will be correctly ordered w.r.to all other
> accesses in program order.

Wow, I hadn't noticed dcache_enable() also enables the MMU. Cool. One step
closer to splitting the mapping between a cached and uncached, strongly-ordered
region of RAM.

> 4. Compiler barriers prevent the compiler from doing certain
> optimizations, but doesn't help in these situations.

Thanks for this excellent summary!

> In short, I think you don't need to worry about ordering, but not
> because of compiler barriers or armv7 being in-order.

Well, I do need to worry about the buffer-descriptors and buffers then, since
they are in SDRAM, right?

Best regards,

-- 
David Jander
Protonic Holland.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-20 11:31                   ` David Jander
@ 2011-07-20 12:05                     ` Aneesh V
  0 siblings, 0 replies; 36+ messages in thread
From: Aneesh V @ 2011-07-20 12:05 UTC (permalink / raw)
  To: u-boot

Hi David,

On Wednesday 20 July 2011 05:01 PM, David Jander wrote:
>
> Hi Aneesh,
>
> On Wed, 20 Jul 2011 15:59:42 +0530
> Aneesh V<aneesh@ti.com>  wrote:
>> On Wednesday 20 July 2011 02:51 PM, David Jander wrote:
>>> On Wed, 20 Jul 2011 10:56:07 +0200
>>
>> [snip ..]
>>
>>>>> Any ideas?
>>>>
>>>> Yes, one: I had issues with the Marvell Ethernet adapter, which has DMA
>>>> as well, not because of cache (it was not active at the time) but
>>>> because of instruction reordering done by the compiler when optimizing,
>>>> which resulted in out-of-order accesses to the descritpors and DMA
>>>> registered, so that the DMA start was written before the descriptors
>>>> were set up. The (proper and clean) solution was to introduce memory
>>>> barriers at strategic points of the driver to ensure that specific DMA
>>>> starts were done only after all writes to the descriptors (and possibly
>>>> buffers).
>>>
>>> Hmmm. I know that issue, but AFAICS, the driver already uses readX() and
>>> writeX() macros when accessing register and DMA memory. Those macros have
>>> compiler barriers included.... and armv7 is still in-order execution ;-)
>>
>> 1. armv7 is not necessarily in-order. Cortex-A8 is in-order while
>> Cortex-A9 is out-of-order.
>> 2. I am not sure if in-order execution guarantees memory ordering. My
>> understanding is that it doesn't.
>> 3. In u-boot's implementation of MMU for ARM, the register
>> space(everything other than SDRAM) is 'strongly-ordered' memory.
>> Strongly ordered accesses will be correctly ordered w.r.to all other
>> accesses in program order.
>
> Wow, I hadn't noticed dcache_enable() also enables the MMU. Cool. One step
> closer to splitting the mapping between a cached and uncached, strongly-ordered
> region of RAM.

Yes. MMU is enabled with an identity mapping. Please note that ARMv6
onwards D-cache can not be enabled when MMU is disabled. I-cache works
without MMU enabled.

>
>> 4. Compiler barriers prevent the compiler from doing certain
>> optimizations, but doesn't help in these situations.
>
> Thanks for this excellent summary!
>
>> In short, I think you don't need to worry about ordering, but not
>> because of compiler barriers or armv7 being in-order.
>
> Well, I do need to worry about the buffer-descriptors and buffers then, since
> they are in SDRAM, right?

Yes, you need to worry about buffers and descriptors. Here are the
rules-of-thumb to take care of them:

1. Flush before initiating memory to peripheral DMA(this applies to
descriptors too)
2. Invalidate after peripheral to memory DMA (before the CPU reads
the buffer)
3. Make sure that any buffer that you have to invalidate is cache-line
aligned at the beginning and at the end.

br,
Aneesh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-19 11:07   ` Matthias Weißer
  2011-07-19 11:17     ` David Jander
@ 2011-07-19 11:19     ` Wolfgang Denk
  2011-07-19 14:31       ` Matthias Weißer
  2011-07-19 11:51     ` Aneesh V
  2 siblings, 1 reply; 36+ messages in thread
From: Wolfgang Denk @ 2011-07-19 11:19 UTC (permalink / raw)
  To: u-boot

Dear =?ISO-8859-1?Q?Matthias_Wei=DFer?=,

In message <4E256588.4010301@arcor.de> you wrote:
> 
> Is this really a good idea? This will break a couple of boards using 
> non-cache-aware drivers. And there are a couple of them in u-boot. I 
> think d-cache should be opt-in rather then opt-out as long as there are 
> any drivers which didn't handle cached memory regions correct. i-cache 
> is much less problematic and can be enabled by default.

If we do this, then everybody will just be lazy, and nothing will ever
change.

The lesson we learned (a number of times) is that people only start
moving when it really hurts, i. e. when things really break for them.
I don't like that, but if it's the only way to make any progress then
so be it.

> If d-cache will be enabled by default on ARM I think I have to send a 
> patch for one of my boards :-)

Why don't you just help identifying and fixing the problems in the
misbehaving drivers?!?

That would be a _much_ more helpful approach.

Thanks.


Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd at denx.de
"I've finally learned what `upward compatible' means. It means we get
to keep all our old mistakes." - Dennie van Tassel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-19 11:19     ` Wolfgang Denk
@ 2011-07-19 14:31       ` Matthias Weißer
  0 siblings, 0 replies; 36+ messages in thread
From: Matthias Weißer @ 2011-07-19 14:31 UTC (permalink / raw)
  To: u-boot

Am 19.07.2011 13:19, schrieb Wolfgang Denk:
> Dear =?ISO-8859-1?Q?Matthias_Wei=DFer?=,
>
> In message<4E256588.4010301@arcor.de>  you wrote:
>>
>> Is this really a good idea? This will break a couple of boards using
>> non-cache-aware drivers. And there are a couple of them in u-boot. I
>> think d-cache should be opt-in rather then opt-out as long as there are
>> any drivers which didn't handle cached memory regions correct. i-cache
>> is much less problematic and can be enabled by default.
>
> If we do this, then everybody will just be lazy, and nothing will ever
> change.

You are right. But time is a limited resource. And fixing cache related 
issues is a bit more complex then, for example, the simple changes 
needed when relocation was introduced in ARM.

>> If d-cache will be enabled by default on ARM I think I have to send a
>> patch for one of my boards :-)
>
> Why don't you just help identifying and fixing the problems in the
> misbehaving drivers?!?
>
> That would be a _much_ more helpful approach.

Well, that is what I meant. I have to fix the driver then. But this will 
interfere with the fact that time is a lim.... You know ;-)

If both of my boards are in mainline I will take a look at them and try 
to fix problematic drivers (mainly USB OHCI should be the problem)

Regards
Matthias

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-19 11:07   ` Matthias Weißer
  2011-07-19 11:17     ` David Jander
  2011-07-19 11:19     ` Wolfgang Denk
@ 2011-07-19 11:51     ` Aneesh V
  2 siblings, 0 replies; 36+ messages in thread
From: Aneesh V @ 2011-07-19 11:51 UTC (permalink / raw)
  To: u-boot

Dear Matthias,

On Tuesday 19 July 2011 04:37 PM, Matthias Wei?er wrote:
> Dear Aneesh
>
> Am 18.07.2011 18:16, schrieb Aneesh V:
>> commit c2dd0d45540397704de9b13287417d21049d34c6
>> armv7: integrate cache maintenance support
>>
>> In this patch I added a call to dcache_enable() at the beginning of
>> board_init_r() for ARM(i.e. as soon as relocation is over). As a result
>> D-cache will be enabled for all ARM platforms now unless
>> CONFIG_SYS_DCACHE_OFF is set.
>
> Is this really a good idea? This will break a couple of boards using
> non-cache-aware drivers. And there are a couple of them in u-boot. I
> think d-cache should be opt-in rather then opt-out as long as there are
> any drivers which didn't handle cached memory regions correct. i-cache
> is much less problematic and can be enabled by default.

D-cache was always an opt-out in u-boot based on the fact that the flag
was CONFIG_SYS_DCACHE_OFF and not something like
CONFIG_SYS_DCACHE_ENABLE. I just made the behavior more consistent and
predictable. Also, my idea was to enable it(if allowed by the config
flag) at the earliest possible time rather than waiting for a random
driver or board file to enable it much later.

Regarding the question of whether it should be opt-out or opt-in I have
no preference myself.

best regards,
Aneesh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-18 15:18 [U-Boot] i.MX51: FEC: Cache coherency problem? David Jander
  2011-07-18 16:16 ` Aneesh V
@ 2011-07-18 16:55 ` Stefano Babic
  2011-07-19  7:44   ` David Jander
  1 sibling, 1 reply; 36+ messages in thread
From: Stefano Babic @ 2011-07-18 16:55 UTC (permalink / raw)
  To: u-boot

On 07/18/2011 05:18 PM, David Jander wrote:
> 
> Hi all,

Hi David,

> What is going on here? Why did this work with caches enabled before??

I think cache was always disabled..

Best regards,
Stefano Babic

-- 
=====================================================================
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: +49-8142-66989-0 Fax: +49-8142-66989-80  Email: office at denx.de
=====================================================================

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-18 16:55 ` Stefano Babic
@ 2011-07-19  7:44   ` David Jander
  2011-07-19  8:21     ` Albert ARIBAUD
  0 siblings, 1 reply; 36+ messages in thread
From: David Jander @ 2011-07-19  7:44 UTC (permalink / raw)
  To: u-boot

Hi Stefano,

On Mon, 18 Jul 2011 18:55:05 +0200
Stefano Babic <sbabic@denx.de> wrote:

> On 07/18/2011 05:18 PM, David Jander wrote:
> > 
> > Hi all,
> 
> Hi David,
> 
> > What is going on here? Why did this work with caches enabled before??
> 
> I think cache was always disabled..

I had even L2-caches enabled in u-boot (copied/adapted some code from OMAP
cache.S), and called i/dcache_enable() from board code like this:

int board_late_init(void)
{
        power_init();
        probe_board_type();
        icache_enable();
        dcache_enable();

        return 0;
}

Is there a reason this wouldn't have worked before?

Suppose it didn't. Does that mean we need to use the MMU to properly mark
regions of register space and specially FEC BD's as not-cached? Or do we need
to flash caches manually each time such a memory region is accessed?
I am kind of a CPU-speed-junkie, so I am not sure I want to live without
caches enabled in u-boot ;-)

Best regards,

-- 
David Jander
Protonic Holland.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-19  7:44   ` David Jander
@ 2011-07-19  8:21     ` Albert ARIBAUD
  2011-07-19  8:37       ` David Jander
  0 siblings, 1 reply; 36+ messages in thread
From: Albert ARIBAUD @ 2011-07-19  8:21 UTC (permalink / raw)
  To: u-boot

Hi David,

Le 19/07/2011 09:44, David Jander a ?crit :
>
> Hi Stefano,
>
> On Mon, 18 Jul 2011 18:55:05 +0200
> Stefano Babic<sbabic@denx.de>  wrote:
>
>> On 07/18/2011 05:18 PM, David Jander wrote:
>>>
>>> Hi all,
>>
>> Hi David,
>>
>>> What is going on here? Why did this work with caches enabled before??
>>
>> I think cache was always disabled..
>
> I had even L2-caches enabled in u-boot (copied/adapted some code from OMAP
> cache.S), and called i/dcache_enable() from board code like this:
>
> int board_late_init(void)
> {
>          power_init();
>          probe_board_type();
>          icache_enable();
>          dcache_enable();
>
>          return 0;
> }
>
> Is there a reason this wouldn't have worked before?
>
> Suppose it didn't. Does that mean we need to use the MMU to properly mark
> regions of register space and specially FEC BD's as not-cached? Or do we need
> to flash caches manually each time such a memory region is accessed?
> I am kind of a CPU-speed-junkie, so I am not sure I want to live without
> caches enabled in u-boot ;-)

You would have to flush (before sending packets / starting external 
memory-to-device DMA) and invalidate (before reading received packets / 
after external device-to-memory DMA is done); using MMU and mapping 
cached/non-cached areas is IMO overkill, and will hurt CPU accesses to 
the xmit/receive buffers and descriptors.

> Best regards,

Amicalement,
-- 
Albert.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-19  8:21     ` Albert ARIBAUD
@ 2011-07-19  8:37       ` David Jander
  2011-07-19  8:43         ` Aneesh V
  0 siblings, 1 reply; 36+ messages in thread
From: David Jander @ 2011-07-19  8:37 UTC (permalink / raw)
  To: u-boot

On Tue, 19 Jul 2011 10:21:12 +0200
Albert ARIBAUD <albert.u.boot@aribaud.net> wrote:

> Hi David,
> 
> Le 19/07/2011 09:44, David Jander a ?crit :
> >
> > Hi Stefano,
> >
> > On Mon, 18 Jul 2011 18:55:05 +0200
> > Stefano Babic<sbabic@denx.de>  wrote:
> >
> >> On 07/18/2011 05:18 PM, David Jander wrote:
> >>>
> >>> Hi all,
> >>
> >> Hi David,
> >>
> >>> What is going on here? Why did this work with caches enabled before??
> >>
> >> I think cache was always disabled..
> >
> > I had even L2-caches enabled in u-boot (copied/adapted some code from OMAP
> > cache.S), and called i/dcache_enable() from board code like this:
> >
> > int board_late_init(void)
> > {
> >          power_init();
> >          probe_board_type();
> >          icache_enable();
> >          dcache_enable();
> >
> >          return 0;
> > }
> >
> > Is there a reason this wouldn't have worked before?
> >
> > Suppose it didn't. Does that mean we need to use the MMU to properly mark
> > regions of register space and specially FEC BD's as not-cached? Or do we
> > need to flash caches manually each time such a memory region is accessed?
> > I am kind of a CPU-speed-junkie, so I am not sure I want to live without
> > caches enabled in u-boot ;-)
> 
> You would have to flush (before sending packets / starting external 
> memory-to-device DMA) and invalidate (before reading received packets / 
> after external device-to-memory DMA is done); using MMU and mapping 
> cached/non-cached areas is IMO overkill, and will hurt CPU accesses to 
> the xmit/receive buffers and descriptors.

So, you say actually what I did while exploring the problem would have been a
correct way of solving this problem?

Like this:

587         flush_cache(&fec->tbd_base[fec->tbd_index], 4);
588         fec_tx_task_enable(fec);
589         flush_dcache_all();
590 
591         /*
592          * wait until frame is sent .
593          */
594         while (readw(&fec->tbd_base[fec->tbd_index].status) & FEC_TBD_READY) {
595                 udelay(1);
596         }

I am still not sure why I need line 587 above.
Would a patch to fec_mxc.c that does the necessary cache handwork be
acceptable for u-boot?

Best regards,

-- 
David Jander
Protonic Holland.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-19  8:37       ` David Jander
@ 2011-07-19  8:43         ` Aneesh V
  2011-07-19  8:58           ` David Jander
  2011-07-19  9:05           ` Albert ARIBAUD
  0 siblings, 2 replies; 36+ messages in thread
From: Aneesh V @ 2011-07-19  8:43 UTC (permalink / raw)
  To: u-boot



On Tuesday 19 July 2011 02:07 PM, David Jander wrote:
> On Tue, 19 Jul 2011 10:21:12 +0200
> Albert ARIBAUD<albert.u.boot@aribaud.net>  wrote:
>
>> Hi David,
>>
>> Le 19/07/2011 09:44, David Jander a ?crit :
>>>
>>> Hi Stefano,
>>>
>>> On Mon, 18 Jul 2011 18:55:05 +0200
>>> Stefano Babic<sbabic@denx.de>   wrote:
>>>
>>>> On 07/18/2011 05:18 PM, David Jander wrote:
>>>>>
>>>>> Hi all,
>>>>
>>>> Hi David,
>>>>
>>>>> What is going on here? Why did this work with caches enabled before??
>>>>
>>>> I think cache was always disabled..
>>>
>>> I had even L2-caches enabled in u-boot (copied/adapted some code from OMAP
>>> cache.S), and called i/dcache_enable() from board code like this:
>>>
>>> int board_late_init(void)
>>> {
>>>           power_init();
>>>           probe_board_type();
>>>           icache_enable();
>>>           dcache_enable();
>>>
>>>           return 0;
>>> }
>>>
>>> Is there a reason this wouldn't have worked before?
>>>
>>> Suppose it didn't. Does that mean we need to use the MMU to properly mark
>>> regions of register space and specially FEC BD's as not-cached? Or do we
>>> need to flash caches manually each time such a memory region is accessed?
>>> I am kind of a CPU-speed-junkie, so I am not sure I want to live without
>>> caches enabled in u-boot ;-)

Are you talking about some memory-mapped IO region(register space). If
that is the case, that region won't be cached. ARM mmu implementation
makes only the SDRAM region cached. Rest is non-cached non-buffered.

>>
>> You would have to flush (before sending packets / starting external
>> memory-to-device DMA) and invalidate (before reading received packets /
>> after external device-to-memory DMA is done); using MMU and mapping
>> cached/non-cached areas is IMO overkill, and will hurt CPU accesses to
>> the xmit/receive buffers and descriptors.
>
> So, you say actually what I did while exploring the problem would have been a
> correct way of solving this problem?
>
> Like this:
>
> 587         flush_cache(&fec->tbd_base[fec->tbd_index], 4);

This is what is needed assuming the below is initiating a memory to
peripheral DMA. Is your buffer only 4 bytes long?

Also, please check if flush_cache() is correctly supported for your
CPU. The default implementation in in arch/arm/lib/cache.c has support
for only a handful of cpus. AFAIK, only armv7 is over-riding this
default implementation at the moment.

The fact that it's helping you indicates that it may be working for
you. But still worth a check.

> 588         fec_tx_task_enable(fec);
> 589         flush_dcache_all();

This should not be needed.

> 590
> 591         /*
> 592          * wait until frame is sent .
> 593          */
> 594         while (readw(&fec->tbd_base[fec->tbd_index].status)&  FEC_TBD_READY) {
> 595                 udelay(1);
> 596         }
>
> I am still not sure why I need line 587 above.

Did you try keeping 587 and removing 589?

best regards,
Aneesh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-19  8:43         ` Aneesh V
@ 2011-07-19  8:58           ` David Jander
  2011-07-19  9:11             ` Albert ARIBAUD
  2011-07-19 11:42             ` Aneesh V
  2011-07-19  9:05           ` Albert ARIBAUD
  1 sibling, 2 replies; 36+ messages in thread
From: David Jander @ 2011-07-19  8:58 UTC (permalink / raw)
  To: u-boot


Dear Aneesh,

Thanks a lot for your replies.

On Tue, 19 Jul 2011 14:13:34 +0530
Aneesh V <aneesh@ti.com> wrote:
> On Tuesday 19 July 2011 02:07 PM, David Jander wrote:
> > On Tue, 19 Jul 2011 10:21:12 +0200
> > Albert ARIBAUD<albert.u.boot@aribaud.net>  wrote:
> >
> >> Hi David,
> >>
> >> Le 19/07/2011 09:44, David Jander a ?crit :
> >>>
> >>> Hi Stefano,
> >>>
> >>> On Mon, 18 Jul 2011 18:55:05 +0200
> >>> Stefano Babic<sbabic@denx.de>   wrote:
> >>>
> >>>> On 07/18/2011 05:18 PM, David Jander wrote:
> >>>>>
> >>>>> Hi all,
> >>>>
> >>>> Hi David,
> >>>>
> >>>>> What is going on here? Why did this work with caches enabled before??
> >>>>
> >>>> I think cache was always disabled..
> >>>
> >>> I had even L2-caches enabled in u-boot (copied/adapted some code from
> >>> OMAP cache.S), and called i/dcache_enable() from board code like this:
> >>>
> >>> int board_late_init(void)
> >>> {
> >>>           power_init();
> >>>           probe_board_type();
> >>>           icache_enable();
> >>>           dcache_enable();
> >>>
> >>>           return 0;
> >>> }
> >>>
> >>> Is there a reason this wouldn't have worked before?
> >>>
> >>> Suppose it didn't. Does that mean we need to use the MMU to properly mark
> >>> regions of register space and specially FEC BD's as not-cached? Or do we
> >>> need to flash caches manually each time such a memory region is accessed?
> >>> I am kind of a CPU-speed-junkie, so I am not sure I want to live without
> >>> caches enabled in u-boot ;-)
> 
> Are you talking about some memory-mapped IO region(register space). If
> that is the case, that region won't be cached. ARM mmu implementation
> makes only the SDRAM region cached. Rest is non-cached non-buffered.

Ah, thanks for pointing out. I already suspected something like that...

> >> You would have to flush (before sending packets / starting external
> >> memory-to-device DMA) and invalidate (before reading received packets /
> >> after external device-to-memory DMA is done); using MMU and mapping
> >> cached/non-cached areas is IMO overkill, and will hurt CPU accesses to
> >> the xmit/receive buffers and descriptors.
> >
> > So, you say actually what I did while exploring the problem would have
> > been a correct way of solving this problem?
> >
> > Like this:
> >
> > 587         flush_cache(&fec->tbd_base[fec->tbd_index], 4);
> 
> This is what is needed assuming the below is initiating a memory to
> peripheral DMA. Is your buffer only 4 bytes long?

No it isn't. I know, I should flush the whole buffer area, but this was just
enough to get the status field flushed, so the FEC started transmitting, and
the while loop ended eventually. The result was still not correct, but at
least it won't hang.
What would be more expensive, flushing just the buffer area, or
flush_dcache_all()?

> Also, please check if flush_cache() is correctly supported for your
> CPU. The default implementation in in arch/arm/lib/cache.c has support
> for only a handful of cpus. AFAIK, only armv7 is over-riding this
> default implementation at the moment.

There is cache_v7.c which implements these... they are supposed to work
correctly I guess?

> The fact that it's helping you indicates that it may be working for
> you. But still worth a check.
> 
> > 588         fec_tx_task_enable(fec);
> > 589         flush_dcache_all();
> 
> This should not be needed.

I agree, but without it the while loop below still gets stuck.

> > 590
> > 591         /*
> > 592          * wait until frame is sent .
> > 593          */
> > 594         while (readw(&fec->tbd_base[fec->tbd_index].status)&
> > FEC_TBD_READY) {
> > 595                 udelay(1);
> > 596         }
> >
> > I am still not sure why I need line 587 above.
> 
> Did you try keeping 587 and removing 589?

Yes, I did. These two lines really is the minimum necessary to exit the while
loop. It won't work if I leave out either of those.
Now that I know a little more, I guess this is because of the status field in
the buffer-descriptor being checked in the while loop, and that is still in
cache. So the only thing line 589 does, is invalidate the caches, so the next
readw() returns the value stored by the FEC, which is apparently faster than
this piece of code :-)

Best regards,

-- 
David Jander
Protonic Holland.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-19  8:58           ` David Jander
@ 2011-07-19  9:11             ` Albert ARIBAUD
  2011-07-19 11:50               ` Aneesh V
  2011-07-19 11:42             ` Aneesh V
  1 sibling, 1 reply; 36+ messages in thread
From: Albert ARIBAUD @ 2011-07-19  9:11 UTC (permalink / raw)
  To: u-boot

Hi David,

Le 19/07/2011 10:58, David Jander a ?crit :

>>> 587         flush_cache(&fec->tbd_base[fec->tbd_index], 4);
>>
>> This is what is needed assuming the below is initiating a memory to
>> peripheral DMA. Is your buffer only 4 bytes long?
>
> No it isn't. I know, I should flush the whole buffer area, but this was just
> enough to get the status field flushed, so the FEC started transmitting, and
> the while loop ended eventually. The result was still not correct, but at
> least it won't hang.

Seems like you're flushing while the DMA engine is running... That's 
calling for race conditions IMO. You should only have the DMA engine 
running while you are sending something or while you are expecting to 
receive something (U-Boot differs from an OS in that devices are not 
kept running unless actually needed).

> What would be more expensive, flushing just the buffer area, or
> flush_dcache_all()?

You should call the API for flushing the area only. Obviously, you'll 
flush either just as much, or possibly more (thus take longer) if you 
flush the whole data cache.

> Yes, I did. These two lines really is the minimum necessary to exit the while
> loop. It won't work if I leave out either of those.
> Now that I know a little more, I guess this is because of the status field in
> the buffer-descriptor being checked in the while loop, and that is still in
> cache. So the only thing line 589 does, is invalidate the caches, so the next
> readw() returns the value stored by the FEC, which is apparently faster than
> this piece of code :-)

One more reason not to let it run its DMA until you're ready. :)

> Best regards,

Amicalement,
-- 
Albert.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-19  9:11             ` Albert ARIBAUD
@ 2011-07-19 11:50               ` Aneesh V
  0 siblings, 0 replies; 36+ messages in thread
From: Aneesh V @ 2011-07-19 11:50 UTC (permalink / raw)
  To: u-boot

Hi Albert, David,

On Tuesday 19 July 2011 02:41 PM, Albert ARIBAUD wrote:
> Hi David,
>
> Le 19/07/2011 10:58, David Jander a ?crit :
>
>>>> 587 flush_cache(&fec->tbd_base[fec->tbd_index], 4);
>>>
>>> This is what is needed assuming the below is initiating a memory to
>>> peripheral DMA. Is your buffer only 4 bytes long?
>>
>> No it isn't. I know, I should flush the whole buffer area, but this
>> was just
>> enough to get the status field flushed, so the FEC started
>> transmitting, and
>> the while loop ended eventually. The result was still not correct, but at
>> least it won't hang.
>
> Seems like you're flushing while the DMA engine is running... That's
> calling for race conditions IMO. You should only have the DMA engine
> running while you are sending something or while you are expecting to
> receive something (U-Boot differs from an OS in that devices are not
> kept running unless actually needed).
>
>> What would be more expensive, flushing just the buffer area, or
>> flush_dcache_all()?
>
> You should call the API for flushing the area only. Obviously, you'll
> flush either just as much, or possibly more (thus take longer) if you
> flush the whole data cache.

Flushing an entire cache line by line is typically not as efficient as
using the special instructions to flush the entire cache(particularly
for large L2 caches). In our experiments on the 1MB L2$ cache on
Cortex-A9 the latter was more than 2x fast compared to the former.

best regards,
Aneesh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-19  8:58           ` David Jander
  2011-07-19  9:11             ` Albert ARIBAUD
@ 2011-07-19 11:42             ` Aneesh V
  1 sibling, 0 replies; 36+ messages in thread
From: Aneesh V @ 2011-07-19 11:42 UTC (permalink / raw)
  To: u-boot

On Tuesday 19 July 2011 02:28 PM, David Jander wrote:
>
> Dear Aneesh,
>
> Thanks a lot for your replies.
>
> On Tue, 19 Jul 2011 14:13:34 +0530
> Aneesh V<aneesh@ti.com>  wrote:
>> On Tuesday 19 July 2011 02:07 PM, David Jander wrote:
>>> On Tue, 19 Jul 2011 10:21:12 +0200
>>> Albert ARIBAUD<albert.u.boot@aribaud.net>   wrote:
>>>
>>>> Hi David,
>>>>
>>>> Le 19/07/2011 09:44, David Jander a ?crit :
>>>>>
>>>>> Hi Stefano,
>>>>>
>>>>> On Mon, 18 Jul 2011 18:55:05 +0200
>>>>> Stefano Babic<sbabic@denx.de>    wrote:
>>>>>
>>>>>> On 07/18/2011 05:18 PM, David Jander wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>
>>>>>> Hi David,
>>>>>>
>>>>>>> What is going on here? Why did this work with caches enabled before??
>>>>>>
>>>>>> I think cache was always disabled..
>>>>>
>>>>> I had even L2-caches enabled in u-boot (copied/adapted some code from
>>>>> OMAP cache.S), and called i/dcache_enable() from board code like this:
>>>>>
>>>>> int board_late_init(void)
>>>>> {
>>>>>            power_init();
>>>>>            probe_board_type();
>>>>>            icache_enable();
>>>>>            dcache_enable();
>>>>>
>>>>>            return 0;
>>>>> }
>>>>>
>>>>> Is there a reason this wouldn't have worked before?
>>>>>
>>>>> Suppose it didn't. Does that mean we need to use the MMU to properly mark
>>>>> regions of register space and specially FEC BD's as not-cached? Or do we
>>>>> need to flash caches manually each time such a memory region is accessed?
>>>>> I am kind of a CPU-speed-junkie, so I am not sure I want to live without
>>>>> caches enabled in u-boot ;-)
>>
>> Are you talking about some memory-mapped IO region(register space). If
>> that is the case, that region won't be cached. ARM mmu implementation
>> makes only the SDRAM region cached. Rest is non-cached non-buffered.
>
> Ah, thanks for pointing out. I already suspected something like that...
>
>>>> You would have to flush (before sending packets / starting external
>>>> memory-to-device DMA) and invalidate (before reading received packets /
>>>> after external device-to-memory DMA is done); using MMU and mapping
>>>> cached/non-cached areas is IMO overkill, and will hurt CPU accesses to
>>>> the xmit/receive buffers and descriptors.
>>>
>>> So, you say actually what I did while exploring the problem would have
>>> been a correct way of solving this problem?
>>>
>>> Like this:
>>>
>>> 587         flush_cache(&fec->tbd_base[fec->tbd_index], 4);
>>
>> This is what is needed assuming the below is initiating a memory to
>> peripheral DMA. Is your buffer only 4 bytes long?
>
> No it isn't. I know, I should flush the whole buffer area, but this was just
> enough to get the status field flushed, so the FEC started transmitting, and
> the while loop ended eventually. The result was still not correct, but at
> least it won't hang.
> What would be more expensive, flushing just the buffer area, or
> flush_dcache_all()?

That depends on the buffer size. If the buffer size is really big
(comparable to L2$ size) flush_dcache_all() may turn out to be better.

>
>> Also, please check if flush_cache() is correctly supported for your
>> CPU. The default implementation in in arch/arm/lib/cache.c has support
>> for only a handful of cpus. AFAIK, only armv7 is over-riding this
>> default implementation at the moment.
>
> There is cache_v7.c which implements these... they are supposed to work
> correctly I guess?

Oops! I didn't notice that your CPU is armv7.
Yes, it's supposed to work correctly.

best regards,
Aneesh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-19  8:43         ` Aneesh V
  2011-07-19  8:58           ` David Jander
@ 2011-07-19  9:05           ` Albert ARIBAUD
  2011-07-19 14:36             ` J. William Campbell
  1 sibling, 1 reply; 36+ messages in thread
From: Albert ARIBAUD @ 2011-07-19  9:05 UTC (permalink / raw)
  To: u-boot

Le 19/07/2011 10:43, Aneesh V a ?crit :

>>> You would have to flush (before sending packets / starting external
>>> memory-to-device DMA) and invalidate (before reading received packets /
>>> after external device-to-memory DMA is done); using MMU and mapping
>>> cached/non-cached areas is IMO overkill, and will hurt CPU accesses to
>>> the xmit/receive buffers and descriptors.
>>
>> So, you say actually what I did while exploring the problem would have
>> been a
>> correct way of solving this problem?
>>
>> Like this:
>>
>> 587 flush_cache(&fec->tbd_base[fec->tbd_index], 4);
>
> This is what is needed assuming the below is initiating a memory to
> peripheral DMA. Is your buffer only 4 bytes long?

Generally:

- for sending data through a device that has its own, external, DMA 
engine, you'll obviously need to flush the data buffer(s) but also any 
DMA descriptors used by the engine, before you start the engine;

- for rceiving, if you have to set up receive descriptors, you must 
flush that before telling the device to enter receive mode (so that the 
device reads the descriptors as you wrote them), and you should 
invalidate the receive buffers at the latest when the device signals 
that data has been received, or preferably long before (at the same time 
you flushed the read descriptor, so that cache-related actions are 
grouped in the same place in the code).

Amicalement,
-- 
Albert.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-19  9:05           ` Albert ARIBAUD
@ 2011-07-19 14:36             ` J. William Campbell
  2011-07-19 15:17               ` David Jander
  2011-07-19 18:14               ` Anton Staaf
  0 siblings, 2 replies; 36+ messages in thread
From: J. William Campbell @ 2011-07-19 14:36 UTC (permalink / raw)
  To: u-boot

On 7/19/2011 2:05 AM, Albert ARIBAUD wrote:
> Le 19/07/2011 10:43, Aneesh V a ?crit :
>
>>>> You would have to flush (before sending packets / starting external
>>>> memory-to-device DMA) and invalidate (before reading received packets /
>>>> after external device-to-memory DMA is done); using MMU and mapping
>>>> cached/non-cached areas is IMO overkill, and will hurt CPU accesses to
>>>> the xmit/receive buffers and descriptors.
>>> So, you say actually what I did while exploring the problem would have
>>> been a
>>> correct way of solving this problem?
>>>
>>> Like this:
>>>
>>> 587 flush_cache(&fec->tbd_base[fec->tbd_index], 4);
>> This is what is needed assuming the below is initiating a memory to
>> peripheral DMA. Is your buffer only 4 bytes long?
> Generally:
>
> - for sending data through a device that has its own, external, DMA
> engine, you'll obviously need to flush the data buffer(s) but also any
> DMA descriptors used by the engine, before you start the engine;
>
> - for rceiving, if you have to set up receive descriptors, you must
> flush that before telling the device to enter receive mode (so that the
> device reads the descriptors as you wrote them), and you should
> invalidate the receive buffers at the latest when the device signals
> that data has been received,
Hi All,

>   or preferably long before (at the same time
> you flushed the read descriptor, so that cache-related actions are
> grouped in the same place in the code).
I think this last statement is incorrect. You should invalidate the 
cache for the receive buffers just before you intend to reference them. 
If you do it right after receive mode is entered, subsequent access to 
items NEAR the receive buffer may reload the cache with receive buffer 
data before the dma is done, re-creating the problem you are trying to 
avoid. Also, I don't know if ARM cache is write-back or write -thru, but 
if it is write-back, the only way to avoid problems is to allocate the 
receive buffers on cache line boundaries, so no "nearby" writes can 
cause something in the DMA buffer to be corrupted. If all receive 
buffers are allocated on cache-line boundaries (both start and end of 
each buffer), you can invalidate the cache "early" under the assumption 
that there will be no read accesses to the read buffers themselves until 
after DMA is complete. IMHO it is better, even in this case., to 
invalidate cache after dma is done but before referencing the read data.

Best Regards,
Bill Campbell

>
> Amicalement,

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-19 14:36             ` J. William Campbell
@ 2011-07-19 15:17               ` David Jander
  2011-07-19 18:14               ` Anton Staaf
  1 sibling, 0 replies; 36+ messages in thread
From: David Jander @ 2011-07-19 15:17 UTC (permalink / raw)
  To: u-boot

On Tue, 19 Jul 2011 07:36:32 -0700
"J. William Campbell" <jwilliamcampbell@comcast.net> wrote:

> On 7/19/2011 2:05 AM, Albert ARIBAUD wrote:
> > Le 19/07/2011 10:43, Aneesh V a ?crit :
> >
> >>>> You would have to flush (before sending packets / starting external
> >>>> memory-to-device DMA) and invalidate (before reading received packets /
> >>>> after external device-to-memory DMA is done); using MMU and mapping
> >>>> cached/non-cached areas is IMO overkill, and will hurt CPU accesses to
> >>>> the xmit/receive buffers and descriptors.
> >>> So, you say actually what I did while exploring the problem would have
> >>> been a
> >>> correct way of solving this problem?
> >>>
> >>> Like this:
> >>>
> >>> 587 flush_cache(&fec->tbd_base[fec->tbd_index], 4);
> >> This is what is needed assuming the below is initiating a memory to
> >> peripheral DMA. Is your buffer only 4 bytes long?
> > Generally:
> >
> > - for sending data through a device that has its own, external, DMA
> > engine, you'll obviously need to flush the data buffer(s) but also any
> > DMA descriptors used by the engine, before you start the engine;
> >
> > - for rceiving, if you have to set up receive descriptors, you must
> > flush that before telling the device to enter receive mode (so that the
> > device reads the descriptors as you wrote them), and you should
> > invalidate the receive buffers at the latest when the device signals
> > that data has been received,
> Hi All,
> 
> >   or preferably long before (at the same time
> > you flushed the read descriptor, so that cache-related actions are
> > grouped in the same place in the code).
> I think this last statement is incorrect. You should invalidate the 
> cache for the receive buffers just before you intend to reference them. 
> If you do it right after receive mode is entered, subsequent access to 
> items NEAR the receive buffer may reload the cache with receive buffer 
> data before the dma is done, re-creating the problem you are trying to 
> avoid. Also, I don't know if ARM cache is write-back or write -thru, but 
> if it is write-back, the only way to avoid problems is to allocate the 
> receive buffers on cache line boundaries, so no "nearby" writes can 
> cause something in the DMA buffer to be corrupted. If all receive 
> buffers are allocated on cache-line boundaries (both start and end of 
> each buffer), you can invalidate the cache "early" under the assumption 
> that there will be no read accesses to the read buffers themselves until 
> after DMA is complete. IMHO it is better, even in this case., to 
> invalidate cache after dma is done but before referencing the read data.

Hey, thanks a lot for this bit of information! I really needed it! :-)
Indeed I hadn't thought about the situation of nearby memory accesses
allocating critical cache-lines...

Coming to think of it... would it really be that hard to enable the MMU in
u-boot and configure just two pools of flat, direct mapped linear memory to
malloc and dma_malloc from?
That would make fixing broken drivers trivial, and have other benefits (like
video memory coherence when caches are on)!
IMHO, changing the necessary malloc()'s for dma_malloc()'s is a lot less work
and easier to oversee than obscure cache-manipulation and alignment tricks all
over the place.... What do you think?

Best regards,

-- 
David Jander
Protonic Holland.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-19 14:36             ` J. William Campbell
  2011-07-19 15:17               ` David Jander
@ 2011-07-19 18:14               ` Anton Staaf
  2011-07-19 20:11                 ` J. William Campbell
  2011-07-20  8:37                 ` Aneesh V
  1 sibling, 2 replies; 36+ messages in thread
From: Anton Staaf @ 2011-07-19 18:14 UTC (permalink / raw)
  To: u-boot

On Tue, Jul 19, 2011 at 7:36 AM, J. William Campbell
<jwilliamcampbell@comcast.net> wrote:
> On 7/19/2011 2:05 AM, Albert ARIBAUD wrote:
>> Le 19/07/2011 10:43, Aneesh V a ?crit :
>>
>>>>> You would have to flush (before sending packets / starting external
>>>>> memory-to-device DMA) and invalidate (before reading received packets /
>>>>> after external device-to-memory DMA is done); using MMU and mapping
>>>>> cached/non-cached areas is IMO overkill, and will hurt CPU accesses to
>>>>> the xmit/receive buffers and descriptors.
>>>> So, you say actually what I did while exploring the problem would have
>>>> been a
>>>> correct way of solving this problem?
>>>>
>>>> Like this:
>>>>
>>>> 587 flush_cache(&fec->tbd_base[fec->tbd_index], 4);
>>> This is what is needed assuming the below is initiating a memory to
>>> peripheral DMA. Is your buffer only 4 bytes long?
>> Generally:
>>
>> - for sending data through a device that has its own, external, DMA
>> engine, you'll obviously need to flush the data buffer(s) but also any
>> DMA descriptors used by the engine, before you start the engine;
>>
>> - for rceiving, if you have to set up receive descriptors, you must
>> flush that before telling the device to enter receive mode (so that the
>> device reads the descriptors as you wrote them), and you should
>> invalidate the receive buffers at the latest when the device signals
>> that data has been received,
> Hi All,
>
>> ? or preferably long before (at the same time
>> you flushed the read descriptor, so that cache-related actions are
>> grouped in the same place in the code).
> I think this last statement is incorrect. You should invalidate the
> cache for the receive buffers just before you intend to reference them.
> If you do it right after receive mode is entered, subsequent access to
> items NEAR the receive buffer may reload the cache with receive buffer
> data before the dma is done, re-creating the problem you are trying to
> avoid. Also, I don't know if ARM cache is write-back or write -thru, but
> if it is write-back, the only way to avoid problems is to allocate the
> receive buffers on cache line boundaries, so no "nearby" writes can
> cause something in the DMA buffer to be corrupted. If all receive
> buffers are allocated on cache-line boundaries (both start and end of
> each buffer), you can invalidate the cache "early" under the assumption
> that there will be no read accesses to the read buffers themselves until
> after DMA is complete. IMHO it is better, even in this case., to
> invalidate cache after dma is done but before referencing the read data.

This is a critical observation, and one I was going to make if I had
made it to the end of the thread and no one had already pointed it
out.  In fact, there is no way with the current implementation (that I
can see) of the v7_dcache_inval_range function to correctly implement
a cache enabled DMA driven driver without aligning buffers to cache
line sizes.  Below is the commit message from a recent patch I made
locally to fix the Tegra MMC driver.  I wanted to start a discussion
on the list about forcing all buffers to be aligned to cache line
sizes.  The problem is that many buffers are stack allocated or part
of structs that are of unknown alignment.

    mmc: Tegra2: Enable dcache support by bouncing unaligned requests.

    Dcache support was disabled due to dcache/dma interactions with
    unaligned buffers.  When an unaligned buffer is used for DMA the
    first and last few bytes of the buffer will be clobbered by the
    dcache invalidate call that is required to make the contents of
    the buffer visible to the CPU post DMA.  The reason that these
    bytes are clobbered is that the dcache invalidate code (which is
    the same as the linux kernels) checks for unaligned invalidates
    and first flushes the cache lines that are being partially
    invalidated.  This is required because the cache lines may be
    shared with other variables, and may be dirty.  This flush however
    writes over the first and last few bytes of the unaligned buffer
    with whatever happened to be in the cache.

    There are a number of possible solutions:

    1) Modify the invalidate code to first read the partial cache line
    and then invalidate and then write back just the valid part of the
    line.  This suffers from a race condition with concurrent code in
    interrupt handlers or other CPUs.

    2) Modify all of U-Boot to allocate buffers from a block allocation
    mechanism that ensures they are alligned on cache line boundaries.
    While this is possible, there are some cases where the public API
    passes down a buffer pointer all the way to the block read interface.
    So all stand alone U-Boot programs would have to be fixed as well.

    3) Use a bounce buffer that is known to be alligned.  Allocate the
    buffer in the Tegra MMC code and ensure it is large enough for the
    read/write as well as aligned correctly.  This solution requires an
    extra memcpy of the data read or written and has a high water mark
    on memory consumption.  It can be conditionally used only when the
    buffer passed in is unaligned.

    This patch implements the third solution.

-Anton

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-19 18:14               ` Anton Staaf
@ 2011-07-19 20:11                 ` J. William Campbell
  2011-07-20 13:02                   ` Albert ARIBAUD
  2011-07-20  8:37                 ` Aneesh V
  1 sibling, 1 reply; 36+ messages in thread
From: J. William Campbell @ 2011-07-19 20:11 UTC (permalink / raw)
  To: u-boot

On 7/19/2011 11:14 AM, Anton Staaf wrote:
> On Tue, Jul 19, 2011 at 7:36 AM, J. William Campbell
> <jwilliamcampbell@comcast.net>  wrote:
>> On 7/19/2011 2:05 AM, Albert ARIBAUD wrote:
>>> Le 19/07/2011 10:43, Aneesh V a ?crit :
>>>
>>>>>> You would have to flush (before sending packets / starting external
>>>>>> memory-to-device DMA) and invalidate (before reading received packets /
>>>>>> after external device-to-memory DMA is done); using MMU and mapping
>>>>>> cached/non-cached areas is IMO overkill, and will hurt CPU accesses to
>>>>>> the xmit/receive buffers and descriptors.
>>>>> So, you say actually what I did while exploring the problem would have
>>>>> been a
>>>>> correct way of solving this problem?
>>>>>
>>>>> Like this:
>>>>>
>>>>> 587 flush_cache(&fec->tbd_base[fec->tbd_index], 4);
>>>> This is what is needed assuming the below is initiating a memory to
>>>> peripheral DMA. Is your buffer only 4 bytes long?
>>> Generally:
>>>
>>> - for sending data through a device that has its own, external, DMA
>>> engine, you'll obviously need to flush the data buffer(s) but also any
>>> DMA descriptors used by the engine, before you start the engine;
>>>
>>> - for rceiving, if you have to set up receive descriptors, you must
>>> flush that before telling the device to enter receive mode (so that the
>>> device reads the descriptors as you wrote them), and you should
>>> invalidate the receive buffers at the latest when the device signals
>>> that data has been received,
>> Hi All,
>>
>>>    or preferably long before (at the same time
>>> you flushed the read descriptor, so that cache-related actions are
>>> grouped in the same place in the code).
>> I think this last statement is incorrect. You should invalidate the
>> cache for the receive buffers just before you intend to reference them.
>> If you do it right after receive mode is entered, subsequent access to
>> items NEAR the receive buffer may reload the cache with receive buffer
>> data before the dma is done, re-creating the problem you are trying to
>> avoid. Also, I don't know if ARM cache is write-back or write -thru, but
>> if it is write-back, the only way to avoid problems is to allocate the
>> receive buffers on cache line boundaries, so no "nearby" writes can
>> cause something in the DMA buffer to be corrupted. If all receive
>> buffers are allocated on cache-line boundaries (both start and end of
>> each buffer), you can invalidate the cache "early" under the assumption
>> that there will be no read accesses to the read buffers themselves until
>> after DMA is complete. IMHO it is better, even in this case., to
>> invalidate cache after dma is done but before referencing the read data.
> This is a critical observation, and one I was going to make if I had
> made it to the end of the thread and no one had already pointed it
> out.  In fact, there is no way with the current implementation (that I
> can see) of the v7_dcache_inval_range function to correctly implement
> a cache enabled DMA driven driver without aligning buffers to cache
> line sizes.  Below is the commit message from a recent patch I made
> locally to fix the Tegra MMC driver.  I wanted to start a discussion
> on the list about forcing all buffers to be aligned to cache line
> sizes.  The problem is that many buffers are stack allocated or part
> of structs that are of unknown alignment.
>
>      mmc: Tegra2: Enable dcache support by bouncing unaligned requests.
>
>      Dcache support was disabled due to dcache/dma interactions with
>      unaligned buffers.  When an unaligned buffer is used for DMA the
>      first and last few bytes of the buffer will be clobbered by the
>      dcache invalidate call that is required to make the contents of
>      the buffer visible to the CPU post DMA.  The reason that these
>      bytes are clobbered is that the dcache invalidate code (which is
>      the same as the linux kernels) checks for unaligned invalidates
>      and first flushes the cache lines that are being partially
>      invalidated.  This is required because the cache lines may be
>      shared with other variables, and may be dirty.
Hi Anton,
>    This flush however
>      writes over the first and last few bytes of the unaligned buffer
>      with whatever happened to be in the cache.
If this is true, then it means that the cache is of type write-back (as 
opposed to write-thru). From a (very brief) look at the arm7 manuals, it 
appears that both types of cache may be present in the cpu. Do you know 
how this operates? If all that we have is write-back, you are correct. 
There is no safe way to do DMA to non-cache aligned buffers. This means 
both ends of the buffer must be cache aligned, so you may have to 
allocate it bigger than the transfer requires. For unfamiliar readers, 
write-back cache has only a single dirty bit for each cache line. When a 
dirty line is flushed, the cpu doesn't know which data changed, so it 
writes all locations back into memory, thus write-back. Write-thru cache 
updates cache on a write, and also schedules a write to memory 
immediately of the new data.  That way, no writes are actually necessary 
to flush the cache. The downside is, many updates of the same memory 
location will produce a lot of main-memory writes that may not actually 
be required.

Best Regards,
Bill Campbell
>
>      There are a number of possible solutions:
>
>      1) Modify the invalidate code to first read the partial cache line
>      and then invalidate and then write back just the valid part of the
>      line.  This suffers from a race condition with concurrent code in
>      interrupt handlers or other CPUs.
>
>      2) Modify all of U-Boot to allocate buffers from a block allocation
>      mechanism that ensures they are alligned on cache line boundaries.
>      While this is possible, there are some cases where the public API
>      passes down a buffer pointer all the way to the block read interface.
>      So all stand alone U-Boot programs would have to be fixed as well.
>
>      3) Use a bounce buffer that is known to be alligned.  Allocate the
>      buffer in the Tegra MMC code and ensure it is large enough for the
>      read/write as well as aligned correctly.  This solution requires an
>      extra memcpy of the data read or written and has a high water mark
>      on memory consumption.  It can be conditionally used only when the
>      buffer passed in is unaligned.
>
>      This patch implements the third solution.
>
> -Anton
>
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-19 20:11                 ` J. William Campbell
@ 2011-07-20 13:02                   ` Albert ARIBAUD
       [not found]                     ` <4E26DF9D.5070709@comcast.net>
  0 siblings, 1 reply; 36+ messages in thread
From: Albert ARIBAUD @ 2011-07-20 13:02 UTC (permalink / raw)
  To: u-boot

Le 19/07/2011 22:11, J. William Campbell a ?crit :

> If this is true, then it means that the cache is of type write-back (as
> opposed to write-thru). From a (very brief) look at the arm7 manuals, it
> appears that both types of cache may be present in the cpu. Do you know
> how this operates?

Usually, copyback (rather than writeback) and writethough are modes of 
operation, not cache types.

Amicalement,
-- 
Albert.

^ permalink raw reply	[flat|nested] 36+ messages in thread

[parent not found: <4E26DF9D.5070709@comcast.net>]

[parent not found: <4E26E7AA.9070001@aribaud.net>]

* [U-Boot] i.MX51: FEC: Cache coherency problem?
       [not found]                       ` <4E26E7AA.9070001@aribaud.net>
@ 2011-07-20 15:36                         ` J. William Campbell
  2011-07-21  6:48                           ` David Jander
  0 siblings, 1 reply; 36+ messages in thread
From: J. William Campbell @ 2011-07-20 15:36 UTC (permalink / raw)
  To: u-boot

On 7/20/2011 7:35 AM, Albert ARIBAUD wrote:
> Le 20/07/2011 16:01, J. William Campbell a ?crit :
>> On 7/20/2011 6:02 AM, Albert ARIBAUD wrote:
>>> Le 19/07/2011 22:11, J. William Campbell a ?crit :
>>>
>>>> If this is true, then it means that the cache is of type write-back 
>>>> (as
>>>> opposed to write-thru). From a (very brief) look at the arm7 
>>>> manuals, it
>>>> appears that both types of cache may be present in the cpu. Do you 
>>>> know
>>>> how this operates?
>>> Usually, copyback (rather than writeback) and writethough are modes of
>>> operation, not cache types.
>> Hi Albert,
>> One some CPUs both cache modes are available. On many other CPUs (I
>> would guess most), you have one fixed mode available, but not both. I
>> have always seen the two modes described as write-back and
>> write-through, but I am sure we are talking about the same things.
>
> We are. Copy-back is another name for write-back, not used by ARM but 
> by some others.
>
>> The
>> examples that have both modes that I am familiar with have the mode as a
>> "global" setting. It is not controlled by bits in the TLB or anything
>> like that. How does it work on ARM? Is it fixed, globally, globally
>> controlled, or controlled by memory management?
>
> Well, it's a bit complicated, because it depends on the architecture 
> version *and* implementation -- ARM themselves do not mandate things, 
> and it is up to the SoC designer to specify what cache they want and 
> what mode it supports, both at L1 and L2, in their specific instance 
> of ARM cores. And yes, you can have memory areas that are write-back 
> and others that are write-through in the same system.
>
>> If it is controlled by memory management, it looks to me like lots of
>> problems could be avoided by operating with input type buffers set as
>> write-through. One probably isn't going to be writing to input buffers
>> much under program control anyway, so the performance loss should be
>> minimal. This gets rid of the alignment restrictions on these buffers
>> but not the invalidate/flush requirements.
>
> There's not much you can do about alignment issues except align to 
> cache line boundaries.
>
>> However, if memory management
>> is required to set the cache mode, it might be best to operate with the
>> buffers and descriptors un-cached. That gets rid of the flush/invalidate
>> requirement at the expense of slowing down copying from read buffers.
>
> That makes 'best' a subjective choice, doesn't it? :)
Hi All,
         Yes,it probably depends on the usage.
>
>> Probably a reasonable price to pay for the associated simplicity.
>
> Others would say that spending some time setting up alignments and 
> flushes and invalidates is a reasonable price to pay for increased 
> performance... That's an open debate where no solution is The Right 
> One(tm).
>
> For instance, consider the TFTP image reading. People would like the 
> image to end up in cached memory because we'll do some checksumming on 
> it before we give it control, and having it cached makes this step 
> quite faster; but we'll lose that if we put it in non-cached memory 
> because it comes through the Ethernet controller's DMA; and it would 
> be worse to receive packets in non-cached memory only to move their 
> contents into cached memory later on.
>
> I think properly aligning descriptors and buffers is enough to avoid 
> the mixed flush/invalidate line issue, and wisely putting instruction 
> barriers should be enough to get the added performance of cache 
> without too much of the hassle of memory management.
I am pretty sure that all the drivers read the input data into 
intermediate buffers in all cases. There is no practical way to be sure 
the next packet received is the "right one" for the tftp. Plus there are 
headers involved, and furthermore there is no way to ensure that a tftp 
destination is located on a sector boundary. In short, you are going to 
copy from an input buffer to a destination.
However, it is still correct that copying from an non-cached area is 
slower than from cached areas, because of burst reads vs. individual 
reads. However, I doubt that the u-boot user can tell the difference, as 
the network latency will far exceed the difference in copy time. The 
question is, which is easier to do, and that is probably a matter of 
opinion. However, it is safe to say that so far a cached solution has 
eluded us. That may be changing, but it would still be nice to know how 
to allocate a section of un-cached RAM in the ARM processor, in so far 
as the question has a single answer! That would allow easy portability 
of drivers that do not know about caches, of which there seems to be many.

Best Regards,
Bill Campbell
>
>> Best Regards,
>> Bill Campbell
>
> Amicalement,

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-20 15:36                         ` J. William Campbell
@ 2011-07-21  6:48                           ` David Jander
  2011-07-23 13:04                             ` Albert ARIBAUD
  0 siblings, 1 reply; 36+ messages in thread
From: David Jander @ 2011-07-21  6:48 UTC (permalink / raw)
  To: u-boot

On Wed, 20 Jul 2011 08:36:12 -0700
"J. William Campbell" <jwilliamcampbell@comcast.net> wrote:

> On 7/20/2011 7:35 AM, Albert ARIBAUD wrote:
> > Le 20/07/2011 16:01, J. William Campbell a ?crit :
> >> On 7/20/2011 6:02 AM, Albert ARIBAUD wrote:
> >>> Le 19/07/2011 22:11, J. William Campbell a ?crit :
> >>>
> >>>> If this is true, then it means that the cache is of type write-back 
> >>>> (as
> >>>> opposed to write-thru). From a (very brief) look at the arm7 
> >>>> manuals, it
> >>>> appears that both types of cache may be present in the cpu. Do you 
> >>>> know
> >>>> how this operates?
> >>> Usually, copyback (rather than writeback) and writethough are modes of
> >>> operation, not cache types.
> >> Hi Albert,
> >> One some CPUs both cache modes are available. On many other CPUs (I
> >> would guess most), you have one fixed mode available, but not both. I
> >> have always seen the two modes described as write-back and
> >> write-through, but I am sure we are talking about the same things.
> >
> > We are. Copy-back is another name for write-back, not used by ARM but 
> > by some others.
> >
> >> The
> >> examples that have both modes that I am familiar with have the mode as a
> >> "global" setting. It is not controlled by bits in the TLB or anything
> >> like that. How does it work on ARM? Is it fixed, globally, globally
> >> controlled, or controlled by memory management?
> >
> > Well, it's a bit complicated, because it depends on the architecture 
> > version *and* implementation -- ARM themselves do not mandate things, 
> > and it is up to the SoC designer to specify what cache they want and 
> > what mode it supports, both at L1 and L2, in their specific instance 
> > of ARM cores. And yes, you can have memory areas that are write-back 
> > and others that are write-through in the same system.
> >
> >> If it is controlled by memory management, it looks to me like lots of
> >> problems could be avoided by operating with input type buffers set as
> >> write-through. One probably isn't going to be writing to input buffers
> >> much under program control anyway, so the performance loss should be
> >> minimal. This gets rid of the alignment restrictions on these buffers
> >> but not the invalidate/flush requirements.
> >
> > There's not much you can do about alignment issues except align to 
> > cache line boundaries.
> >
> >> However, if memory management
> >> is required to set the cache mode, it might be best to operate with the
> >> buffers and descriptors un-cached. That gets rid of the flush/invalidate
> >> requirement at the expense of slowing down copying from read buffers.
> >
> > That makes 'best' a subjective choice, doesn't it? :)
> Hi All,
>          Yes,it probably depends on the usage.
> >
> >> Probably a reasonable price to pay for the associated simplicity.
> >
> > Others would say that spending some time setting up alignments and 
> > flushes and invalidates is a reasonable price to pay for increased 
> > performance... That's an open debate where no solution is The Right 
> > One(tm).
> >
> > For instance, consider the TFTP image reading. People would like the 
> > image to end up in cached memory because we'll do some checksumming on 
> > it before we give it control, and having it cached makes this step 
> > quite faster; but we'll lose that if we put it in non-cached memory 
> > because it comes through the Ethernet controller's DMA; and it would 
> > be worse to receive packets in non-cached memory only to move their 
> > contents into cached memory later on.
> >
> > I think properly aligning descriptors and buffers is enough to avoid 
> > the mixed flush/invalidate line issue, and wisely putting instruction 
> > barriers should be enough to get the added performance of cache 
> > without too much of the hassle of memory management.
> I am pretty sure that all the drivers read the input data into 
> intermediate buffers in all cases. There is no practical way to be sure 
> the next packet received is the "right one" for the tftp. Plus there are 
> headers involved, and furthermore there is no way to ensure that a tftp 
> destination is located on a sector boundary. In short, you are going to 
> copy from an input buffer to a destination.
> However, it is still correct that copying from an non-cached area is 
> slower than from cached areas, because of burst reads vs. individual 
> reads. However, I doubt that the u-boot user can tell the difference, as 
> the network latency will far exceed the difference in copy time. The 
> question is, which is easier to do, and that is probably a matter of 
> opinion. However, it is safe to say that so far a cached solution has 
> eluded us. That may be changing, but it would still be nice to know how 
> to allocate a section of un-cached RAM in the ARM processor, in so far 
> as the question has a single answer! That would allow easy portability 
> of drivers that do not know about caches, of which there seems to be many.

I agree. Unfortunately, my time is up for now, and I can't go on with trying
to fix this driver. Maybe I'll pick up after my vacation.
As for now I settled for the ugly solution of keeping dcache disabled while
ethernet is being used :-(
IMHO, doing cache maintenance all over the driver is not an easy or nice
solution. Implementing a non-cached memory pool in the MMU and a corresponding
dma_malloc() sounds like much more universally applicable to any driver.

Best regards,

-- 
David Jander
Protonic Holland.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-21  6:48                           ` David Jander
@ 2011-07-23 13:04                             ` Albert ARIBAUD
  2011-07-23 15:35                               ` J. William Campbell
  0 siblings, 1 reply; 36+ messages in thread
From: Albert ARIBAUD @ 2011-07-23 13:04 UTC (permalink / raw)
  To: u-boot

Le 21/07/2011 08:48, David Jander a ?crit :

>> However, it is still correct that copying from an non-cached area is
>> slower than from cached areas, because of burst reads vs. individual
>> reads. However, I doubt that the u-boot user can tell the difference, as
>> the network latency will far exceed the difference in copy time.

That's assuming cache is only for networking. There can be DMA engines 
in a lot of other peripherals which do not have the same latency as 
network (and then, even for networking, TFTP can be done from a very 
nearby server, possibly even on the same Ethernet segment).

>> The
>> question is, which is easier to do, and that is probably a matter of
>> opinion. However, it is safe to say that so far a cached solution has
>> eluded us. That may be changing, but it would still be nice to know how
>> to allocate a section of un-cached RAM in the ARM processor, in so far
>> as the question has a single answer! That would allow easy portability
>> of drivers that do not know about caches, of which there seems to be many.

That is one approach, which I think prevents cache from being used 
beyond caching pure CPU-used DRAM.

> I agree. Unfortunately, my time is up for now, and I can't go on with trying
> to fix this driver. Maybe I'll pick up after my vacation.
> As for now I settled for the ugly solution of keeping dcache disabled while
> ethernet is being used :-(

Make sure you flush before disabling. :)

> IMHO, doing cache maintenance all over the driver is not an easy or nice
> solution. Implementing a non-cached memory pool in the MMU and a corresponding
> dma_malloc() sounds like much more universally applicable to any driver.

I think cache maintenance is feasible if one makes sure the cached areas 
used by the driver are properly aligned, which simplifies things a lot: 
you don't have to care for flush-invalidate or just-in-time invalidate, 
you just have to flush before sending and invalidate before reading.

> Best regards,

Amicalement,
-- 
Albert.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-23 13:04                             ` Albert ARIBAUD
@ 2011-07-23 15:35                               ` J. William Campbell
  0 siblings, 0 replies; 36+ messages in thread
From: J. William Campbell @ 2011-07-23 15:35 UTC (permalink / raw)
  To: u-boot

On 7/23/2011 6:04 AM, Albert ARIBAUD wrote:
> Le 21/07/2011 08:48, David Jander a ?crit :
>
>>> However, it is still correct that copying from an non-cached area is
>>> slower than from cached areas, because of burst reads vs. individual
>>> reads. However, I doubt that the u-boot user can tell the 
>>> difference, as
>>> the network latency will far exceed the difference in copy time.
>
> That's assuming cache is only for networking. There can be DMA engines 
> in a lot of other peripherals which do not have the same latency as 
> network (and then, even for networking, TFTP can be done from a very 
> nearby server, possibly even on the same Ethernet segment).
Hi All,
         Yes, there are other uses of DMA. On a network, unless you have 
a Gigabit network, your memory access speed is at least an order of 
magnitude faster than the network, probably more. Plus, there is a 
latency due to sending the ack and request for the next record that 
undoubtedly swamps out any reduction in memory speed due to the single 
copy that takes place. In the case of other devices, like for example 
disks, the percentage effect is probably greater, but since these 
devices are so fast anyway that the human-perceived speed reduction is 
essentially nil. If we were talking about a CPU running Linux and doing 
all kinds of I/O all day long, the reduction in throughput  performance 
might be 10% and that might matter. In a boot loader that does I/O 
mostly to read in a program to replace itself, I would argue that nobody 
will notice the difference between cached and un-cached buffers. 
Counter-examples welcome however.
>
>>> The
>>> question is, which is easier to do, and that is probably a matter of
>>> opinion. However, it is safe to say that so far a cached solution has
>>> eluded us. That may be changing, but it would still be nice to know how
>>> to allocate a section of un-cached RAM in the ARM processor, in so far
>>> as the question has a single answer! That would allow easy portability
>>> of drivers that do not know about caches, of which there seems to be 
>>> many.
>
> That is one approach, which I think prevents cache from being used 
> beyond caching pure CPU-used DRAM.
You are certainly correct there. However, I think the pure CPU-used ram 
case is the one that matters most. Uncompressing and checksumming of 
input data are typical u-boot functions that take significant time. The 
performance increase due to cache hits in these cases is huge, and 
easily perceptible by the user.
>
>> I agree. Unfortunately, my time is up for now, and I can't go on with 
>> trying
>> to fix this driver. Maybe I'll pick up after my vacation.
>> As for now I settled for the ugly solution of keeping dcache disabled 
>> while
>> ethernet is being used :-(
>
> Make sure you flush before disabling. :)
>
>> IMHO, doing cache maintenance all over the driver is not an easy or nice
>> solution. Implementing a non-cached memory pool in the MMU and a 
>> corresponding
>> dma_malloc() sounds like much more universally applicable to any driver.
>
> I think cache maintenance is feasible if one makes sure the cached 
> areas used by the driver are properly aligned, which simplifies things 
> a lot: you don't have to care for flush-invalidate or just-in-time 
> invalidate, you just have to flush before sending and invalidate 
> before reading.
I do agree it can be done. However, most (I think?) of the CPUs to which 
u-boot have been ported have cache-coherent DMA. As a result, cache 
issues for  these CPUs are not addressed in the driver at all. Often 
this means that cache support is done after the fact by somebody other 
than the original author who may not totally understand the original 
driver. If DMA buffers were always allocated from cache-coherent memory, 
either because the memory is un-cached or because the CPU is DMA cache 
coherent, no changes would be necessary to get the driver working 
correctly.  If performance ever became an issue in the un-cached case, 
then more work would be required, but in most cases, I expect nobody 
will notice.

Best Regards,
Bill Campbell
>
>> Best regards,
>
> Amicalement,

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [U-Boot] i.MX51: FEC: Cache coherency problem?
  2011-07-19 18:14               ` Anton Staaf
  2011-07-19 20:11                 ` J. William Campbell
@ 2011-07-20  8:37                 ` Aneesh V
  1 sibling, 0 replies; 36+ messages in thread
From: Aneesh V @ 2011-07-20  8:37 UTC (permalink / raw)
  To: u-boot

Hi Anton,

On Tuesday 19 July 2011 11:44 PM, Anton Staaf wrote:
[snip ..]

>      There are a number of possible solutions:
>
>      1) Modify the invalidate code to first read the partial cache line
>      and then invalidate and then write back just the valid part of the
>      line.  This suffers from a race condition with concurrent code in
>      interrupt handlers or other CPUs.

How do you propose to implement this?

Are you suggesting something like momentarily turning off D-cache, read
the memory content into registers, enable cache again, mix the memory
content in registers with the cache content and then flush, or
something like that? I am afraid this will be way too complex if at all
viable.

I don't see any other easy option. Am I missing something?

best regards,
Aneesh

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2011-07-23 15:35 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-18 15:18 [U-Boot] i.MX51: FEC: Cache coherency problem? David Jander
2011-07-18 16:16 ` Aneesh V
2011-07-19  7:26   ` David Jander
2011-07-19 11:07   ` Matthias Weißer
2011-07-19 11:17     ` David Jander
2011-07-19 11:20       ` Wolfgang Denk
2011-07-19 12:10         ` David Jander
2011-07-20  6:29           ` David Jander
2011-07-20  8:56             ` Albert ARIBAUD
2011-07-20  9:21               ` David Jander
2011-07-20 10:29                 ` Aneesh V
2011-07-20 11:31                   ` David Jander
2011-07-20 12:05                     ` Aneesh V
2011-07-19 11:19     ` Wolfgang Denk
2011-07-19 14:31       ` Matthias Weißer
2011-07-19 11:51     ` Aneesh V
2011-07-18 16:55 ` Stefano Babic
2011-07-19  7:44   ` David Jander
2011-07-19  8:21     ` Albert ARIBAUD
2011-07-19  8:37       ` David Jander
2011-07-19  8:43         ` Aneesh V
2011-07-19  8:58           ` David Jander
2011-07-19  9:11             ` Albert ARIBAUD
2011-07-19 11:50               ` Aneesh V
2011-07-19 11:42             ` Aneesh V
2011-07-19  9:05           ` Albert ARIBAUD
2011-07-19 14:36             ` J. William Campbell
2011-07-19 15:17               ` David Jander
2011-07-19 18:14               ` Anton Staaf
2011-07-19 20:11                 ` J. William Campbell
2011-07-20 13:02                   ` Albert ARIBAUD
     [not found]                     ` <4E26DF9D.5070709@comcast.net>
     [not found]                       ` <4E26E7AA.9070001@aribaud.net>
2011-07-20 15:36                         ` J. William Campbell
2011-07-21  6:48                           ` David Jander
2011-07-23 13:04                             ` Albert ARIBAUD
2011-07-23 15:35                               ` J. William Campbell
2011-07-20  8:37                 ` Aneesh V

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox