* pgprot_writecombine() and PATs on x86
@ 2007-04-25 18:02 Roland Dreier
2007-04-25 18:19 ` Andi Kleen
0 siblings, 1 reply; 10+ messages in thread
From: Roland Dreier @ 2007-04-25 18:02 UTC (permalink / raw)
To: Eric W. Biederman, linux-kernel; +Cc: mst, jackm, ak
Hi Eric,
Where do your patches to add an implementation of
pgprot_writecombine() using PATs on x86 stand? The mlx4 driver I'm
planning on merging for 2.6.22 would really like writecombining, and
I'm interested in doing the work to finally get the PAT stuff merged
(probably for 2.6.23 I guess).
Just to give a little background on my motivation: the mlx4 hardware
allows a page in its PCI space to be mapped, where the driver can write
descriptors and payloads directly, instead of ringing a doorbell and
having the HW fetch the descriptor from system memory, for better latency.
The driver allows this page to be mapped to userspace and used
directly from latency sensitive stuff like MPI applications. I added
a hacked up version of the PAT stuff to map the page into userspace
with pgprot_writecombine(), and that lowered the measured MPI latency
by 450 nsecs, which doesn't seem like much until I tell you that the
latency when from ~1.85 usec to ~1.4 usec. So copying the descriptor
is a huge part of the total latency and users are definitely going to
want that 25% improvement.
Anyway as I said I want to get the PAT stuff upstream, so I'm posting
this to find out the latest state and avoid duplicating work.
Thanks,
Roland
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: pgprot_writecombine() and PATs on x86
2007-04-25 18:02 pgprot_writecombine() and PATs on x86 Roland Dreier
@ 2007-04-25 18:19 ` Andi Kleen
2007-04-25 18:31 ` Eric W. Biederman
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Andi Kleen @ 2007-04-25 18:19 UTC (permalink / raw)
To: Roland Dreier; +Cc: Eric W. Biederman, linux-kernel, mst, jackm
On Wednesday 25 April 2007 20:02:26 Roland Dreier wrote:
> Hi Eric,
>
> Where do your patches to add an implementation of
> pgprot_writecombine() using PATs on x86 stand?
It's on my todo list.
> The mlx4 driver I'm
> planning on merging for 2.6.22 would really like writecombining, and
> I'm interested in doing the work to finally get the PAT stuff merged
> (probably for 2.6.23 I guess).
>
> Just to give a little background on my motivation: the mlx4 hardware
> allows a page in its PCI space to be mapped, where the driver can write
> descriptors and payloads directly, instead of ringing a doorbell and
> having the HW fetch the descriptor from system memory, for better latency.
When it's PCI space you can likely just use MTRRs. PAT is mostly useful
for applications that do IO with random memory pages
-Andi
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: pgprot_writecombine() and PATs on x86
2007-04-25 18:19 ` Andi Kleen
@ 2007-04-25 18:31 ` Eric W. Biederman
2007-04-25 18:35 ` Roland Dreier
2007-04-25 18:33 ` Roland Dreier
2007-04-25 18:41 ` Dave Jones
2 siblings, 1 reply; 10+ messages in thread
From: Eric W. Biederman @ 2007-04-25 18:31 UTC (permalink / raw)
To: Andi Kleen; +Cc: Roland Dreier, linux-kernel, mst, jackm
Andi Kleen <ak@suse.de> writes:
> On Wednesday 25 April 2007 20:02:26 Roland Dreier wrote:
>> Hi Eric,
>>
>> Where do your patches to add an implementation of
>> pgprot_writecombine() using PATs on x86 stand?
>
> It's on my todo list.
Basically enabling PAT is easy. Adding the paranoid checks is
trickier. I keep intending to do something but...
>> The mlx4 driver I'm
>> planning on merging for 2.6.22 would really like writecombining, and
>> I'm interested in doing the work to finally get the PAT stuff merged
>> (probably for 2.6.23 I guess).
>>
>> Just to give a little background on my motivation: the mlx4 hardware
>> allows a page in its PCI space to be mapped, where the driver can write
>> descriptors and payloads directly, instead of ringing a doorbell and
>> having the HW fetch the descriptor from system memory, for better latency.
>
> When it's PCI space you can likely just use MTRRs. PAT is mostly useful
> for applications that do IO with random memory pages
The problem is that on machines with larger memory configurations (8-12G)
there are no spare mtrrs, or the mtrrs can frequently be configured in
an overlapping way so that we can't set them up. In general mtrrs
work ok for one card possible for two and after that they are just
useless.
PAT is also much easier to use from a driver perspective, and it is
much more portable between architectures. Using mtrrs from drivers
is almost impossible.
Roland is the mlx4 sane enough to put the memory that needs
write-combining a prefetchable bar. So several cards can be combined
together?
Eric
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: pgprot_writecombine() and PATs on x86
2007-04-25 18:31 ` Eric W. Biederman
@ 2007-04-25 18:35 ` Roland Dreier
2007-04-25 18:45 ` Eric W. Biederman
0 siblings, 1 reply; 10+ messages in thread
From: Roland Dreier @ 2007-04-25 18:35 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: Andi Kleen, linux-kernel, mst, jackm
> Roland is the mlx4 sane enough to put the memory that needs
> write-combining a prefetchable bar. So several cards can be combined
> together?
Yes, it is in a prefetchable BAR. It's the second half of the second
BAR in:
0d:00.0 InfiniBand: Mellanox Technologies Unknown device 634a (rev a0)
Subsystem: Mellanox Technologies Unknown device 634a
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at fc400000 (64-bit, non-prefetchable) [size=1M]
Memory at d8000000 (64-bit, prefetchable) [size=8M]
Memory at fc3fe000 (64-bit, non-prefetchable) [size=8K]
Capabilities: <access denied>
but I'm not sure what you mean about combining several cards?
- R.
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: pgprot_writecombine() and PATs on x86
2007-04-25 18:35 ` Roland Dreier
@ 2007-04-25 18:45 ` Eric W. Biederman
2007-04-26 5:43 ` Michael S. Tsirkin
0 siblings, 1 reply; 10+ messages in thread
From: Eric W. Biederman @ 2007-04-25 18:45 UTC (permalink / raw)
To: Roland Dreier; +Cc: Andi Kleen, linux-kernel, mst, jackm
Roland Dreier <rdreier@cisco.com> writes:
> > Roland is the mlx4 sane enough to put the memory that needs
> > write-combining a prefetchable bar. So several cards can be combined
> > together?
>
> Yes, it is in a prefetchable BAR. It's the second half of the second
> BAR in:
>
> 0d:00.0 InfiniBand: Mellanox Technologies Unknown device 634a (rev a0)
> Subsystem: Mellanox Technologies Unknown device 634a
> Flags: bus master, fast devsel, latency 0, IRQ 16
> Memory at fc400000 (64-bit, non-prefetchable) [size=1M]
> Memory at d8000000 (64-bit, prefetchable) [size=8M]
> Memory at fc3fe000 (64-bit, non-prefetchable) [size=8K]
> Capabilities: <access denied>
>
> but I'm not sure what you mean about combining several cards?
So in general the pci prefetchable attribute means write-combining as
well as prefetching is safe. A sane BIOS will allocate prefetchable
BARS contiguously in the address space. So on a good day you
can just use one MTRR to map all of the prefetchable BARs as write-combining.
Eric
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: pgprot_writecombine() and PATs on x86
2007-04-25 18:45 ` Eric W. Biederman
@ 2007-04-26 5:43 ` Michael S. Tsirkin
2007-04-26 6:13 ` Eric W. Biederman
0 siblings, 1 reply; 10+ messages in thread
From: Michael S. Tsirkin @ 2007-04-26 5:43 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Roland Dreier, Andi Kleen, linux-kernel, Jack Morgenstein
> So in general the pci prefetchable attribute means write-combining as
> well as prefetching is safe. A sane BIOS will allocate prefetchable
> BARS contiguously in the address space. So on a good day you
> can just use one MTRR to map all of the prefetchable BARs as write-combining.
Good point, and sounds easy enough.
So why does not linux do it automatically then where possible?
There are sure to be some broken devices, but if some device
can't live with WC, we can always disable WC system-wide.
--
MST
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: pgprot_writecombine() and PATs on x86
2007-04-26 5:43 ` Michael S. Tsirkin
@ 2007-04-26 6:13 ` Eric W. Biederman
0 siblings, 0 replies; 10+ messages in thread
From: Eric W. Biederman @ 2007-04-26 6:13 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Roland Dreier, Andi Kleen, linux-kernel, Jack Morgenstein
"Michael S. Tsirkin" <mst@dev.mellanox.co.il> writes:
>> So in general the pci prefetchable attribute means write-combining as
>> well as prefetching is safe. A sane BIOS will allocate prefetchable
>> BARS contiguously in the address space. So on a good day you
>> can just use one MTRR to map all of the prefetchable BARs as write-combining.
>
> Good point, and sounds easy enough.
> So why does not linux do it automatically then where possible?
It does when we have support in the page tables. The MTRRs appear
to complex to use automatically. Getting the all of the memory
set to write-back using the can be a chore. If things were truly
straight forward every BIOS would setup write-combining automatically.
> There are sure to be some broken devices, but if some device
> can't live with WC, we can always disable WC system-wide.
Yes.
Eric
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: pgprot_writecombine() and PATs on x86
2007-04-25 18:19 ` Andi Kleen
2007-04-25 18:31 ` Eric W. Biederman
@ 2007-04-25 18:33 ` Roland Dreier
2007-04-25 18:41 ` Dave Jones
2 siblings, 0 replies; 10+ messages in thread
From: Roland Dreier @ 2007-04-25 18:33 UTC (permalink / raw)
To: Andi Kleen; +Cc: Eric W. Biederman, linux-kernel, mst, jackm
> > Where do your patches to add an implementation of
> > pgprot_writecombine() using PATs on x86 stand?
>
> It's on my todo list.
Great. Let me know if there's anything I can do to help.
> When it's PCI space you can likely just use MTRRs. PAT is mostly useful
> for applications that do IO with random memory pages
Actually MTRRs seem to be inadequate for a number of reasons. For
example I have a system where /proc/mtrr looks like:
$ cat /proc/mtrr
reg00: base=0x00000000 ( 0MB), size=8192MB: write-back, count=1
reg01: base=0x200000000 (8192MB), size= 512MB: write-back, count=1
reg02: base=0x220000000 (8704MB), size= 256MB: write-back, count=1
reg03: base=0xd0000000 (3328MB), size= 256MB: uncachable, count=1
reg04: base=0xe0000000 (3584MB), size= 512MB: uncachable, count=1
And I want to map the second half of the second BAR of this device
with write-combining:
0d:00.0 InfiniBand: Mellanox Technologies Unknown device 634a (rev a0)
Subsystem: Mellanox Technologies Unknown device 634a
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at fc400000 (64-bit, non-prefetchable) [size=1M]
Memory at d8000000 (64-bit, prefetchable) [size=8M]
Memory at fc3fe000 (64-bit, non-prefetchable) [size=8K]
Capabilities: <access denied>
So it's not clear that there will be enough MTRRs to handle
everything, or that even if there are enough, that there's a safe way
to update the MTRRs to get from the boot-up config to the one we want.
In this case I guess there is a way but it uses all 8 MTRRs, so adding
a device that also wants write combining won't work.
And definitely trying to set up the MTRRs automatically is going to to
be very fragile. So I think having pgprot_writecombine() implemented
with PATs is really the only sane thing even for this PCI space.
- R.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: pgprot_writecombine() and PATs on x86
2007-04-25 18:19 ` Andi Kleen
2007-04-25 18:31 ` Eric W. Biederman
2007-04-25 18:33 ` Roland Dreier
@ 2007-04-25 18:41 ` Dave Jones
2007-04-25 19:10 ` Andi Kleen
2 siblings, 1 reply; 10+ messages in thread
From: Dave Jones @ 2007-04-25 18:41 UTC (permalink / raw)
To: Andi Kleen; +Cc: Roland Dreier, Eric W. Biederman, linux-kernel, mst, jackm
On Wed, Apr 25, 2007 at 08:19:27PM +0200, Andi Kleen wrote:
> On Wednesday 25 April 2007 20:02:26 Roland Dreier wrote:
> > Hi Eric,
> >
> > Where do your patches to add an implementation of
> > pgprot_writecombine() using PATs on x86 stand?
>
> It's on my todo list.
Whats the status on the code at ftp://ftp.firstfloor.org/pub/ak/pat/patches ?
I've not given it much more than a quick skim, so I'm not sure what
it's capable of/what its missing etc..
Dave
--
http://www.codemonkey.org.uk
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: pgprot_writecombine() and PATs on x86
2007-04-25 18:41 ` Dave Jones
@ 2007-04-25 19:10 ` Andi Kleen
0 siblings, 0 replies; 10+ messages in thread
From: Andi Kleen @ 2007-04-25 19:10 UTC (permalink / raw)
To: Dave Jones; +Cc: Roland Dreier, Eric W. Biederman, linux-kernel, mst, jackm
On Wednesday 25 April 2007 20:41:08 Dave Jones wrote:
> On Wed, Apr 25, 2007 at 08:19:27PM +0200, Andi Kleen wrote:
> > On Wednesday 25 April 2007 20:02:26 Roland Dreier wrote:
> > > Hi Eric,
> > >
> > > Where do your patches to add an implementation of
> > > pgprot_writecombine() using PATs on x86 stand?
> >
> > It's on my todo list.
>
> Whats the status on the code at ftp://ftp.firstfloor.org/pub/ak/pat/patches ?
> I've not given it much more than a quick skim, so I'm not sure what
> it's capable of/what its missing etc..
That code has a couple of problems and is relatively broken
-Andi
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2007-04-26 6:15 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-25 18:02 pgprot_writecombine() and PATs on x86 Roland Dreier
2007-04-25 18:19 ` Andi Kleen
2007-04-25 18:31 ` Eric W. Biederman
2007-04-25 18:35 ` Roland Dreier
2007-04-25 18:45 ` Eric W. Biederman
2007-04-26 5:43 ` Michael S. Tsirkin
2007-04-26 6:13 ` Eric W. Biederman
2007-04-25 18:33 ` Roland Dreier
2007-04-25 18:41 ` Dave Jones
2007-04-25 19:10 ` Andi Kleen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox