linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* How to map memory uncached on PPC.
@ 2005-08-19 15:17 Stephen Williams
  2005-08-20  1:06 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 6+ messages in thread
From: Stephen Williams @ 2005-08-19 15:17 UTC (permalink / raw)
  To: linuxppc-dev

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


My setup is Linux PPC kernel 2.4.30 on an embedded PPC405BPr.
The board has some image processing devices including compressors.
I'm working with high image rates so performance is a issue.

The drivers for the pci based compressor chips support readv
and use map_user_kiobuf and pci_map_single to map the output
buffers for the read. (The devices do scatter DMA.) This is
too slow, though. More time is spent mapping then compressing!

I did some measurements, at it seems that the vast amount of
the time is spent in pci_map_single, which calls only the
consistent_sync function, which for FROMDEVICE calls only
invalidate_dcache_range. So I'm convinced that invalidating
the cache for the output buffer (which is large, in case the
image that arrives is large) is taking most of the time. So
I want to eliminate it.

And the way I want to do that is to have a heap of memory in
the user-mode process mapped uncached. The hope is that I can
pass that through the readv to the driver, which sets up the
DMA. Then I can skip the pci_map_single (and the thus the
invalidate_dcache_range) thus saving lots of time.

Plan-B would be to have a driver allocate the heap of memory,
but I really need the mapping into user mode to be uncached,
as the processor does some final touch up (header et al) before
sending it to the next device.
- --
Steve Williams                "The woods are lovely, dark and deep.
steve at icarus.com           But I have promises to keep,
http://www.icarus.com         and lines to code before I sleep,
http://www.picturel.com       And lines to code before I sleep."
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFDBff9rPt1Sc2b3ikRApayAKDboAm10TN8kN3XLtNOFvoYYC/xNwCgx3Nw
ZyHhUzk3xC8/EMBxLD8odtk=
=QrII
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 6+ messages in thread

* How to map memory uncached on PPC
@ 2005-08-19 16:18 Stephen Williams
  0 siblings, 0 replies; 6+ messages in thread
From: Stephen Williams @ 2005-08-19 16:18 UTC (permalink / raw)
  To: linuxppc-embedded

My setup is Linux PPC kernel 2.4.30 on an embedded PPC405GPr.
The board has some image processing devices including compressors.
I'm working with high image rates so performance is an issue.

The drivers for the pci based compressor chips support readv
and use map_user_kiobuf and pci_map_single to map the output
buffers for the read. (The devices do scatter DMA.) This is
too slow, though. More time is spent mapping then compressing!

I did some measurements, at it seems that the vast amount of
the time is spent in pci_map_single, which calls only the
consistent_sync function, which for FROMDEVICE calls only
invalidate_dcache_range. So I'm convinced that invalidating
the cache for the output buffer (which is large, in case the
image that arrives is large) is taking most of the time. So
I want to eliminate it.

And the way I want to do that is to have a heap of memory in
the user-mode process mapped uncached. The hope is that I can
pass that through the readv to the driver, which sets up the
DMA. Then I can skip the pci_map_single (and the thus the
invalidate_dcache_range) thus saving lots of time.

Plan-B would be to have a driver allocate the heap of memory,
but I really need the mapping into user mode to be uncached,
as the processor does some final touch up (header et al) before
sending it to the next device.
-- 
Steve Williams                "The woods are lovely, dark and deep.
steve at icarus.com           But I have promises to keep,
http://www.icarus.com         and lines to code before I sleep,
http://www.picturel.com       And lines to code before I sleep."

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to map memory uncached on PPC.
  2005-08-19 15:17 Stephen Williams
@ 2005-08-20  1:06 ` Benjamin Herrenschmidt
  2005-08-20 15:59   ` Stephen Williams
  0 siblings, 1 reply; 6+ messages in thread
From: Benjamin Herrenschmidt @ 2005-08-20  1:06 UTC (permalink / raw)
  To: Stephen Williams; +Cc: linuxppc-dev

On Fri, 2005-08-19 at 08:17 -0700, Stephen Williams wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> 
> My setup is Linux PPC kernel 2.4.30 on an embedded PPC405BPr.
> The board has some image processing devices including compressors.
> I'm working with high image rates so performance is a issue.
> 
> The drivers for the pci based compressor chips support readv
> and use map_user_kiobuf and pci_map_single to map the output
> buffers for the read. (The devices do scatter DMA.) This is
> too slow, though. More time is spent mapping then compressing!
> 
> I did some measurements, at it seems that the vast amount of
> the time is spent in pci_map_single, which calls only the
> consistent_sync function, which for FROMDEVICE calls only
> invalidate_dcache_range. So I'm convinced that invalidating
> the cache for the output buffer (which is large, in case the
> image that arrives is large) is taking most of the time. So
> I want to eliminate it.
> 
> And the way I want to do that is to have a heap of memory in
> the user-mode process mapped uncached. The hope is that I can
> pass that through the readv to the driver, which sets up the
> DMA. Then I can skip the pci_map_single (and the thus the
> invalidate_dcache_range) thus saving lots of time.
> 
> Plan-B would be to have a driver allocate the heap of memory,
> but I really need the mapping into user mode to be uncached,
> as the processor does some final touch up (header et al) before
> sending it to the next device.

A simple experiment you can do is limit the memory used by the kernel
(booting with mem=xxxx) and then use mmap of /dev/mem to map the
remaining memory like if it was an IO device, uncached. With that, you
get a quick hack solution to validate the performance benefit at least.

Ben.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to map memory uncached on PPC.
  2005-08-20  1:06 ` Benjamin Herrenschmidt
@ 2005-08-20 15:59   ` Stephen Williams
  2005-08-20 18:08     ` John W. Linville
  0 siblings, 1 reply; 6+ messages in thread
From: Stephen Williams @ 2005-08-20 15:59 UTC (permalink / raw)
  To: linuxppc-dev

Benjamin Herrenschmidt wrote:

> On Fri, 2005-08-19 at 08:17 -0700, Stephen Williams wrote:
>>I did some measurements, at it seems that the vast amount of
>>the time is spent in pci_map_single, which calls only the
>>consistent_sync function, which for FROMDEVICE calls only
>>invalidate_dcache_range. So I'm convinced that invalidating
>>the cache for the output buffer (which is large, in case the
>>image that arrives is large) is taking most of the time. So
>>I want to eliminate it.

> A simple experiment you can do is limit the memory used by the kernel
> (booting with mem=xxxx) and then use mmap of /dev/mem to map the
> remaining memory like if it was an IO device, uncached. With that, you
> get a quick hack solution to validate the performance benefit at least.

I did an even simpler experiment: I commented out the pci_map_single,
which on a PPC only has the effect of calling invalidate_dcache_range
and returning the virt_to_bus of the address. Obviously, the cache
is still enabled for the processor, and the image data may get
corrupted, but this was a performance test, not a solution.

Your "test" is a not implausible solution, although it has for me
some administrative problems.

-- 
Steve Williams                "The woods are lovely, dark and deep.
steve at icarus.com           But I have promises to keep,
http://www.icarus.com         and lines to code before I sleep,
http://www.picturel.com       And lines to code before I sleep."

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to map memory uncached on PPC.
  2005-08-20 15:59   ` Stephen Williams
@ 2005-08-20 18:08     ` John W. Linville
  2005-08-21 15:06       ` Stephen Williams
  0 siblings, 1 reply; 6+ messages in thread
From: John W. Linville @ 2005-08-20 18:08 UTC (permalink / raw)
  To: Stephen Williams; +Cc: linuxppc-dev

On Sat, Aug 20, 2005 at 08:59:42AM -0700, Stephen Williams wrote:
> Benjamin Herrenschmidt wrote:

> >A simple experiment you can do is limit the memory used by the kernel
> >(booting with mem=xxxx) and then use mmap of /dev/mem to map the
> >remaining memory like if it was an IO device, uncached. With that, you
> >get a quick hack solution to validate the performance benefit at least.
> 
> I did an even simpler experiment: I commented out the pci_map_single,
> which on a PPC only has the effect of calling invalidate_dcache_range
> and returning the virt_to_bus of the address. Obviously, the cache
> is still enabled for the processor, and the image data may get
> corrupted, but this was a performance test, not a solution.

If your purpose is to evaluate performance, doesn't having the cache
enabled limit the usefulness of your test?  For example if your cache
uses a write-back policy then your test will probably outperform the
actual uncached accesses.  YMMV, I suppose...

John
-- 
John W. Linville
linville@tuxdriver.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to map memory uncached on PPC.
  2005-08-20 18:08     ` John W. Linville
@ 2005-08-21 15:06       ` Stephen Williams
  0 siblings, 0 replies; 6+ messages in thread
From: Stephen Williams @ 2005-08-21 15:06 UTC (permalink / raw)
  To: John W. Linville; +Cc: linuxppc-dev

John W. Linville wrote:
> On Sat, Aug 20, 2005 at 08:59:42AM -0700, Stephen Williams wrote:

>>I did an even simpler experiment: I commented out the pci_map_single,
>>which on a PPC only has the effect of calling invalidate_dcache_range
>>and returning the virt_to_bus of the address. Obviously, the cache
>>is still enabled for the processor, and the image data may get
>>corrupted, but this was a performance test, not a solution.
> 
> 
> If your purpose is to evaluate performance, doesn't having the cache
> enabled limit the usefulness of your test?  For example if your cache
> uses a write-back policy then your test will probably outperform the
> actual uncached accesses.  YMMV, I suppose...

No, it shouldn't. The setup is a driver for a PCI device that masters
its output to the buffer in question. What I'm measuring is not the
time to write the buffer (which is being done by the hardware, so
bypasses the cache anyhow) but the time it takes to *prepare the
buffer for the device*. The buffer is not written to or read from
by device or processor during this time, but it is mapped and DMA
pointers are collected. Also, the pci_map_single invalidates the cache
by walking through the entire buffer (many megabytes) with this loop
from invalidate_dcache_range:

1:	dcbi	0,r3
	addi	r3,r3,L1_CACHE_LINE_SIZE
	bdnz	1b
	sync				/* wait for dcbi's to get to ram */

This is what I believe is taking a long time.

Since nothing is touching that buffer during my test, and I'm mostly
dropping that code, I believe my test is valid.

-- 
Steve Williams                "The woods are lovely, dark and deep.
steve at icarus.com           But I have promises to keep,
http://www.icarus.com         and lines to code before I sleep,
http://www.picturel.com       And lines to code before I sleep."

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2005-08-21 15:07 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-19 16:18 How to map memory uncached on PPC Stephen Williams
  -- strict thread matches above, loose matches on Subject: below --
2005-08-19 15:17 Stephen Williams
2005-08-20  1:06 ` Benjamin Herrenschmidt
2005-08-20 15:59   ` Stephen Williams
2005-08-20 18:08     ` John W. Linville
2005-08-21 15:06       ` Stephen Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).