* Suggestions for how to remove bus_to_virt()
@ 2006-07-12 23:29 Ralph Campbell
2006-07-12 23:40 ` David Miller
2006-07-13 0:11 ` Roland Dreier
0 siblings, 2 replies; 12+ messages in thread
From: Ralph Campbell @ 2006-07-12 23:29 UTC (permalink / raw)
To: Roland Dreier; +Cc: openib-general, linux-kernel
I have been looking at how to eliminate the bus_to_virt() and
phys_to_virt() calls used by the ib_ipath driver.
I am looking for suggestions on how to proceed.
The current IB core to IB device driver interface relies
on a kernel module being able to call ib_get_dma_mr() to allocate
a memory region which represents all of device addressable memory.
The kernel module is then expected to call dma_map_single(),
dma_map_sg(), etc. to convert physical or virtual addresses into
device addresses. If the system has an IOMMU, there may be several
physical pages mapped to a single contiguous device address region.
This device address and length (possibly an array of them) is then
passed to the IB device driver so the IB device can DMA data
to or from memory.
The ib_ipath driver cannot tell the HW to DMA data directly to the
device (IOMMU) addresses and must copy the data. This means the driver
needs to reverse the IOMMU mapping and somehow obtain kernel virtual
addresses so it can memcpy() the data to the correct location.
Currently, the ib_ipath driver requires that the mapping be one-to-one
since there is no practical way to reverse IOMMU mappings.
I believe it is generally agreed that trying to change the dma_map_*
interface to include functions of this sort is not the right approach
to take.
One solution is to change the IB device driver interface so that
kernel virtual addresses are passed to the IB device driver and
the device driver is responsible for calling dma_map_single(), etc.
I believe this will be unacceptable to the OpenFabrics community
since it would require the driver to allocate large amounts of memory
(#QPs * #MaxWRs * sizeof(dma_addr_t + length)) to store the
information needed to undo the mapping when the DMA is complete.
The current IB code allocates the storage for dma_unmap_single(), etc.
as extra elements in structures already needed so it isn't a large
overhead and it is based on the actual number of requests posted
instead of the maximums allowed.
Another solution is to change the IB device driver interface to add
a function which tells the caller what type of addresses the device
expects. Kernel modules would then be required to pass either a
dma_map_xxx() address or a kernel virtual address based on the
driver's preference.
The current set of IB consumers either start with kmalloc/vmalloc
memory (such as the MAD layer) or a list of physical pages
(such as ISER and SRP). The current code could therefore be
fairly easily changed except for ISER/SRP when a struct page
doesn't have a direct kernel address (high pages) and would
need to call kmap()/kunmap() in that case.
I plan to implement this last approach unless someone has
a better idea. I would like to get some "buy-in" before
I spend a lot time coding only to be rejected when finished.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Suggestions for how to remove bus_to_virt()
2006-07-12 23:29 Suggestions for how to remove bus_to_virt() Ralph Campbell
@ 2006-07-12 23:40 ` David Miller
2006-07-13 0:11 ` Roland Dreier
1 sibling, 0 replies; 12+ messages in thread
From: David Miller @ 2006-07-12 23:40 UTC (permalink / raw)
To: ralphc; +Cc: rolandd, openib-general, linux-kernel
From: Ralph Campbell <ralphc@pathscale.com>
Date: Wed, 12 Jul 2006 16:29:27 -0700
> Currently, the ib_ipath driver requires that the mapping be
> one-to-one since there is no practical way to reverse IOMMU
> mappings.
You can maintain a hash table that maps DMA addresses back to kernel
mappings. Depending upon your situation, you can optimize this to use
very small keys if you have some kind of other identification method
for your buffers.
That would be for dynamic mappings.
You were using consistent DMA memory, which I gather you're not,
you could use the PCI DMA pool mechanism.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Suggestions for how to remove bus_to_virt()
2006-07-12 23:29 Suggestions for how to remove bus_to_virt() Ralph Campbell
2006-07-12 23:40 ` David Miller
@ 2006-07-13 0:11 ` Roland Dreier
2006-07-13 0:40 ` David Miller
1 sibling, 1 reply; 12+ messages in thread
From: Roland Dreier @ 2006-07-13 0:11 UTC (permalink / raw)
To: Ralph Campbell; +Cc: Roland Dreier, openib-general, linux-kernel
> One solution is to change the IB device driver interface so that
> kernel virtual addresses are passed to the IB device driver and
> the device driver is responsible for calling dma_map_single(), etc.
> I believe this will be unacceptable to the OpenFabrics community
Actually it's worse than unacceptable -- I don't see how this can work
at all. The problem is that addresses are not just passed directly to
the local HCA; they also might be embedded in protocol messages that
are sent to a remote HCA and then used by the remote HCA to initiate
RDMA.
For example, the SRP driver uses ib_get_dma_mr() to get an R_Key,
which it then sends to the target along with a DMA address. The
target uses that R_Key/address to RDMA data directly to or from the
host. There's no good way for the low-level driver to handle the DMA
mapping, since the address is embedded in a protocol message that the
low-level driver knows nothing about.
> Another solution is to change the IB device driver interface to add
> a function which tells the caller what type of addresses the device
> expects. Kernel modules would then be required to pass either a
> dma_map_xxx() address or a kernel virtual address based on the
> driver's preference.
> The current set of IB consumers either start with kmalloc/vmalloc
> memory (such as the MAD layer) or a list of physical pages
> (such as ISER and SRP). The current code could therefore be
> fairly easily changed except for ISER/SRP when a struct page
> doesn't have a direct kernel address (high pages) and would
> need to call kmap()/kunmap() in that case.
I have a few problems with this: first, it's unfortunate that every
consumer needs two code paths to handle the two possibilities; second,
I don't see how it handles the case of SRP's use of the
ib_get_dma_mr() R_Key as above anyway; third, expecting consumers to
kmap pages for a long time across work request execution is a bad
idea.
Maybe the least bad solution would be to add rdma_XXX wrappers around
the dma mapping functions that RDMA consumers use; then most low-level
drivers could just pass them through to the DMA mapping API, while the
ipath driver could handle things itself.
The problem with that is that it ends up wrapping a huge API -- for
example, you need both dma_map_single and dma_map_sg at least, plus
someone might want to use dma_alloc_coherent memory, not to mention
the dma_pool stuff, etc.
A cleaner solution would be to make the dma_ API really use the device
it's passed anyway, and allow drivers to override the standard PCI
stuff nicely. But that would be major surgery, I guess.
(BTW, vmalloc memory should not be used for DMA, since it's not
guaranteed to be DMA-able -- so anyone doing that is just wrong)
- R.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Suggestions for how to remove bus_to_virt()
2006-07-13 0:11 ` Roland Dreier
@ 2006-07-13 0:40 ` David Miller
2006-07-13 5:46 ` [openib-general] " Muli Ben-Yehuda
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: David Miller @ 2006-07-13 0:40 UTC (permalink / raw)
To: rdreier; +Cc: ralphc, rolandd, openib-general, linux-kernel
From: Roland Dreier <rdreier@cisco.com>
Date: Wed, 12 Jul 2006 17:11:26 -0700
> A cleaner solution would be to make the dma_ API really use the device
> it's passed anyway, and allow drivers to override the standard PCI
> stuff nicely. But that would be major surgery, I guess.
Clean but expensive, you should not force the rest of the kernel
to eat the cost of something you want to do when it's totally
unnecessary for most other users.
For example, x86 never needs to do anything other than a direct
virt_to_phys translation to produce a DMA address, no matter what
bus the device is on. It's a single simple integer adjustment
that can be done inline in about 2 or 3 instructions at most.
Once you start allowing overrides then even x86 starts to eat the
stupid costs of dereferencing some kind of device ops method.
That doesn't make any sense, and that's why the DMA API works the way
it does now. It's a platform or bus operation, not a device one.
If you need device level DMA mapping semantics, create them for your
device type. This is what USB does, btw.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [openib-general] Suggestions for how to remove bus_to_virt()
2006-07-13 0:40 ` David Miller
@ 2006-07-13 5:46 ` Muli Ben-Yehuda
2006-07-14 22:27 ` Ralph Campbell
2006-07-13 7:45 ` Stefan Richter
2006-07-13 16:02 ` Roland Dreier
2 siblings, 1 reply; 12+ messages in thread
From: Muli Ben-Yehuda @ 2006-07-13 5:46 UTC (permalink / raw)
To: David Miller; +Cc: rdreier, linux-kernel, openib-general
On Wed, Jul 12, 2006 at 05:40:13PM -0700, David Miller wrote:
> From: Roland Dreier <rdreier@cisco.com>
> Date: Wed, 12 Jul 2006 17:11:26 -0700
>
> > A cleaner solution would be to make the dma_ API really use the device
> > it's passed anyway, and allow drivers to override the standard PCI
> > stuff nicely. But that would be major surgery, I guess.
>
> Clean but expensive, you should not force the rest of the kernel
> to eat the cost of something you want to do when it's totally
> unnecessary for most other users.
>
> For example, x86 never needs to do anything other than a direct
> virt_to_phys translation to produce a DMA address, no matter what
> bus the device is on. It's a single simple integer adjustment
> that can be done inline in about 2 or 3 instructions at most.
It's possible that even x86 will support multiple IOMMUs in the future
- for example, the Calgary IOMMU support we recently added to x86-64
could be modified to work on plain x86 as well.
I like the idea of a per-device DMA-API implementation, but only if it
can be done in a way that is zero cost to the majority of the users of
the API. We already have dynamic dma_ops on x86-64 to support nommu,
swiotlb, gart and Calgary cleanly, extending it to use a per-device
dma-ops isn't too difficult.
Cheers,
Muli
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Suggestions for how to remove bus_to_virt()
2006-07-13 0:40 ` David Miller
2006-07-13 5:46 ` [openib-general] " Muli Ben-Yehuda
@ 2006-07-13 7:45 ` Stefan Richter
2006-07-13 16:02 ` Roland Dreier
2 siblings, 0 replies; 12+ messages in thread
From: Stefan Richter @ 2006-07-13 7:45 UTC (permalink / raw)
To: ralphc; +Cc: David Miller, rdreier, rolandd, openib-general, linux-kernel
David Miller wrote:
> If you need device level DMA mapping semantics, create them for your
> device type. This is what USB does, btw.
Ralph,
two other examples where drivers provide some sort of address lookup are:
- drivers/ieee1394/dma.[hc]
AFAIK this deals with housekeeping of ringbuffers as used by
1394 controllers for isochronous transmit and receive. Users of
this little API are dv1394, video1394, ohci1394.
- patch "dc395x: dynamically map scatter-gather for PIO" by
Guennadi Liakhovetski,
http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=cdb8c2a6d848deb9eeefffff42974478fbb51b8c
This mapping is not specific to SCSI. The user is a driver which
mixes PIO and DMA.
I don't know if these have any similarity to your requirements though.
(I too need to come up with either a portable replacement of bus_to_virt
or with a fundamentally different implementation but haven't started my
project yet. This occurrence of bus_to_virt is in drivers/ieee1394/sbp2
but #ifdef'd out by default.)
--
Stefan Richter
-=====-=-==- -=== -==--
http://arcgraph.de/sr/
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Suggestions for how to remove bus_to_virt()
2006-07-13 0:40 ` David Miller
2006-07-13 5:46 ` [openib-general] " Muli Ben-Yehuda
2006-07-13 7:45 ` Stefan Richter
@ 2006-07-13 16:02 ` Roland Dreier
2006-07-13 16:37 ` Ralph Campbell
2 siblings, 1 reply; 12+ messages in thread
From: Roland Dreier @ 2006-07-13 16:02 UTC (permalink / raw)
To: David Miller; +Cc: ralphc, rolandd, openib-general, linux-kernel
> > A cleaner solution would be to make the dma_ API really use the device
> > it's passed anyway, and allow drivers to override the standard PCI
> > stuff nicely. But that would be major surgery, I guess.
> Clean but expensive, you should not force the rest of the kernel
> to eat the cost of something you want to do when it's totally
> unnecessary for most other users.
OK, fair enough.
> For example, x86 never needs to do anything other than a direct
> virt_to_phys translation to produce a DMA address, no matter what
> bus the device is on. It's a single simple integer adjustment
> that can be done inline in about 2 or 3 instructions at most.
<pedantic>Except x86 needs to handle systems with IOMMUs now...</pedantic>
> If you need device level DMA mapping semantics, create them for your
> device type. This is what USB does, btw.
Makes sense -- Ralph, I would suggest looking at USB as a model.
- R.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Suggestions for how to remove bus_to_virt()
2006-07-13 16:02 ` Roland Dreier
@ 2006-07-13 16:37 ` Ralph Campbell
0 siblings, 0 replies; 12+ messages in thread
From: Ralph Campbell @ 2006-07-13 16:37 UTC (permalink / raw)
To: Roland Dreier; +Cc: David Miller, rolandd, openib-general, linux-kernel
Thanks to all for the pointers and suggestions.
It will probably take me a while to follow up on these
and make another proposal.
On Thu, 2006-07-13 at 09:02 -0700, Roland Dreier wrote:
> > > A cleaner solution would be to make the dma_ API really use the device
> > > it's passed anyway, and allow drivers to override the standard PCI
> > > stuff nicely. But that would be major surgery, I guess.
>
> > Clean but expensive, you should not force the rest of the kernel
> > to eat the cost of something you want to do when it's totally
> > unnecessary for most other users.
>
> OK, fair enough.
>
> > For example, x86 never needs to do anything other than a direct
> > virt_to_phys translation to produce a DMA address, no matter what
> > bus the device is on. It's a single simple integer adjustment
> > that can be done inline in about 2 or 3 instructions at most.
>
> <pedantic>Except x86 needs to handle systems with IOMMUs now...</pedantic>
>
> > If you need device level DMA mapping semantics, create them for your
> > device type. This is what USB does, btw.
>
> Makes sense -- Ralph, I would suggest looking at USB as a model.
>
> - R.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [openib-general] Suggestions for how to remove bus_to_virt()
2006-07-13 5:46 ` [openib-general] " Muli Ben-Yehuda
@ 2006-07-14 22:27 ` Ralph Campbell
2006-07-14 22:35 ` David Miller
0 siblings, 1 reply; 12+ messages in thread
From: Ralph Campbell @ 2006-07-14 22:27 UTC (permalink / raw)
To: Muli Ben-Yehuda; +Cc: David Miller, rdreier, linux-kernel, openib-general
On Thu, 2006-07-13 at 08:46 +0300, Muli Ben-Yehuda wrote:
> On Wed, Jul 12, 2006 at 05:40:13PM -0700, David Miller wrote:
> > From: Roland Dreier <rdreier@cisco.com>
> > Date: Wed, 12 Jul 2006 17:11:26 -0700
> >
> > > A cleaner solution would be to make the dma_ API really use the device
> > > it's passed anyway, and allow drivers to override the standard PCI
> > > stuff nicely. But that would be major surgery, I guess.
> >
> > Clean but expensive, you should not force the rest of the kernel
> > to eat the cost of something you want to do when it's totally
> > unnecessary for most other users.
> >
> > For example, x86 never needs to do anything other than a direct
> > virt_to_phys translation to produce a DMA address, no matter what
> > bus the device is on. It's a single simple integer adjustment
> > that can be done inline in about 2 or 3 instructions at most.
>
> It's possible that even x86 will support multiple IOMMUs in the future
> - for example, the Calgary IOMMU support we recently added to x86-64
> could be modified to work on plain x86 as well.
>
> I like the idea of a per-device DMA-API implementation, but only if it
> can be done in a way that is zero cost to the majority of the users of
> the API. We already have dynamic dma_ops on x86-64 to support nommu,
> swiotlb, gart and Calgary cleanly, extending it to use a per-device
> dma-ops isn't too difficult.
>
> Cheers,
> Muli
A per-device DMA-API would solve my problem.
It would be a fairly invasive changeset though.
The basic idea would be to add a struct dma_mapping_ops *
to struct device and change all the inline dma_* routines
to something like:
static inline dma_addr_t
dma_map_single(struct device *hwdev, void *ptr, size_t size,
int direction)
{
BUG_ON(!valid_dma_direction(direction));
return hwdev->dma_ops ?
hwdev->dma_ops->map_single(hwdev, ptr, size, direction) :
dma_ops->map_single(hwdev, ptr, size, direction);
}
Note that the current design only supports one IOMMU type in a system.
This could support multiple IOMMU types at the same time.
Another possibility is to only do this for the infiniband subsystem.
The idea would be to replace calls to dma_* with ib_dma_* which
would be defined as above.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [openib-general] Suggestions for how to remove bus_to_virt()
2006-07-14 22:27 ` Ralph Campbell
@ 2006-07-14 22:35 ` David Miller
2006-07-14 23:45 ` Ralph Campbell
0 siblings, 1 reply; 12+ messages in thread
From: David Miller @ 2006-07-14 22:35 UTC (permalink / raw)
To: ralphc; +Cc: muli, rdreier, linux-kernel, openib-general
From: Ralph Campbell <ralphc@pathscale.com>
Date: Fri, 14 Jul 2006 15:27:07 -0700
> Note that the current design only supports one IOMMU type in a system.
> This could support multiple IOMMU types at the same time.
This is not true, the framework allows multiply such types
and in fact Sparc64 takes advantage of this. We have about
4 or 5 different PCI controllers, and the IOMMUs are slightly
different in each.
Even with the standard PCI DMA mapping calls, we can gather the
platform private information necessary to program the IOMMU
appropriately for a given chipset.
The dma_mapping_ops idea will never get accepted by folks like Linus,
for reasons I've outlined in previous emails in this thread. So, it's
best to look elsewhere for solutions to your problem, such as the
ideas used by the USB and IEE1394 device layers.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [openib-general] Suggestions for how to remove bus_to_virt()
2006-07-14 22:35 ` David Miller
@ 2006-07-14 23:45 ` Ralph Campbell
2006-07-15 13:42 ` Stefan Richter
0 siblings, 1 reply; 12+ messages in thread
From: Ralph Campbell @ 2006-07-14 23:45 UTC (permalink / raw)
To: David Miller; +Cc: muli, rdreier, linux-kernel, openib-general
On Fri, 2006-07-14 at 15:35 -0700, David Miller wrote:
> From: Ralph Campbell <ralphc@pathscale.com>
> Date: Fri, 14 Jul 2006 15:27:07 -0700
>
> > Note that the current design only supports one IOMMU type in a system.
> > This could support multiple IOMMU types at the same time.
>
> This is not true, the framework allows multiply such types
> and in fact Sparc64 takes advantage of this. We have about
> 4 or 5 different PCI controllers, and the IOMMUs are slightly
> different in each.
I see. It looks like dma_map_single() is an inline call to
pci_map_single() which is a function call that can then
look at the device and tell what IOMMU function to call.
> Even with the standard PCI DMA mapping calls, we can gather the
> platform private information necessary to program the IOMMU
> appropriately for a given chipset.
>
> The dma_mapping_ops idea will never get accepted by folks like Linus,
> for reasons I've outlined in previous emails in this thread. So, it's
> best to look elsewhere for solutions to your problem, such as the
> ideas used by the USB and IEE1394 device layers.
The USB code won't work in my case because the USB system is
the one doing the memory allocation and IOMMU setup so it
can remember the kernel virtual address or physical pages used
to create the mapping.
In my case, the infiniband (SRP) code is doing the mapping and
only passing the dma_addr_t to the device driver at which point
I have no way to convert it back to a kernel virtual address.
I need to either change the IB device API to include mapping functions
or intercept the dma_* functions so I can save the inputs.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [openib-general] Suggestions for how to remove bus_to_virt()
2006-07-14 23:45 ` Ralph Campbell
@ 2006-07-15 13:42 ` Stefan Richter
0 siblings, 0 replies; 12+ messages in thread
From: Stefan Richter @ 2006-07-15 13:42 UTC (permalink / raw)
To: linux-kernel; +Cc: Ralph Campbell, David Miller, muli, rdreier, openib-general
Ralph Campbell wrote:
> On Fri, 2006-07-14 at 15:35 -0700, David Miller wrote:
...
>> The dma_mapping_ops idea will never get accepted by folks like Linus,
>> for reasons I've outlined in previous emails in this thread. So, it's
>> best to look elsewhere for solutions to your problem, such as the
>> ideas used by the USB and IEE1394 device layers.
>
> The USB code won't work in my case because the USB system is
> the one doing the memory allocation and IOMMU setup so it
> can remember the kernel virtual address or physical pages used
> to create the mapping.
Side note: The same is true with the DMA stuff in the ieee1394
subsystem. And the SCSI subsystem doesn't allocate (all) buffers but
leaves DMA mapping and unmapping to the low-level drivers --- i.e. Ralph
can't rip bus_to_virt replacements from there either, because:
> In my case, the infiniband (SRP) code is doing the mapping and
> only passing the dma_addr_t to the device driver at which point
> I have no way to convert it back to a kernel virtual address.
> I need to either change the IB device API to include mapping functions
> or intercept the dma_* functions so I can save the inputs.
On the other hand, ieee1394/dma is the rather obvious example of a
generic layer which keeps book of virtual address and bus address of
mapped memory regions, for above or below layers to use as they need.
Ralph, do you think you can arrange your required API change as a pure
_extension_ of the IB API? I.e. add fields to data structs or add fields
to callback templates or add calls into the SRP layer... (I haven't
bothered to look at the API yet.)
--
Stefan Richter
-=====-=-==- -=== -====
http://arcgraph.de/sr/
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2006-07-15 13:45 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-12 23:29 Suggestions for how to remove bus_to_virt() Ralph Campbell
2006-07-12 23:40 ` David Miller
2006-07-13 0:11 ` Roland Dreier
2006-07-13 0:40 ` David Miller
2006-07-13 5:46 ` [openib-general] " Muli Ben-Yehuda
2006-07-14 22:27 ` Ralph Campbell
2006-07-14 22:35 ` David Miller
2006-07-14 23:45 ` Ralph Campbell
2006-07-15 13:42 ` Stefan Richter
2006-07-13 7:45 ` Stefan Richter
2006-07-13 16:02 ` Roland Dreier
2006-07-13 16:37 ` Ralph Campbell
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox