* [PATCH 1/2] arm64: dma_mapping: allow PCI host driver to limit DMA mask [not found] ` <6374144.HVL0QxNJiT@wuerfel> @ 2017-01-09 20:34 ` Nikita Yushchenko 2017-01-09 20:57 ` Christoph Hellwig 0 siblings, 1 reply; 14+ messages in thread From: Nikita Yushchenko @ 2017-01-09 20:34 UTC (permalink / raw) [CCing NVMe maintainers since we are discussion issues in that driver] >> With my patch applied and thus 32bit dma_mask set for NVMe device, I do >> see high addresses passed to dma_map_*() routines and handled by >> swiotlb. Thus your statement that behavior "succeed 64bit dma_set_mask() >> operation but silently replace mask behind the scene" is required for >> swiotlb to be used, does not match reality. > > See my point about drivers that don't implement bounce buffering. > Apparently NVMe is one of them, unlike the SATA/SCSI/MMC storage > drivers that do their own thing. I believe the bounce buffering code you refer to is not in SATA/SCSI/MMC but in block layer, in particular it should be controlled by blk_queue_bounce_limit(). [Yes there is CONFIG_MMC_BLOCK_BOUNCE but it is something completely different, namely it is for request merging for hw not supporting scatter-gather]. And NVMe also uses block layer and thus should get same support. But blk_queue_bounce_limit() is somewhat broken, it has very strange code under #if BITS_PER_LONG == 64 that makes setting max_addr to 0xffffffff not working if max_low_pfn is above 4G. Maybe fixing that, together with making NVMe use this API, could stop it from issuing dma_map()s of addresses beyond mask. > What I think happened here in chronological order is: > > - In the old days, 64-bit architectures tended to use an IOMMU > all the time to work around 32-bit limitations on DMA masters > - Some architectures had no IOMMU that fully solved this and the > dma-mapping API required drivers to set the right mask and check > the return code. If this failed, the driver needed to use its > own bounce buffers as network and scsi do. See also the > grossly misnamed "PCI_DMA_BUS_IS_PHYS" macro. > - As we never had support for bounce buffers in all drivers, and > early 64-bit Intel machines had no IOMMU, the swiotlb code was > introduced as a workaround, so we can use the IOMMU case without > driver specific bounce buffers everywhere > - As most of the important 64-bit architectures (x86, arm64, powerpc) > now always have either IOMMU or swiotlb enabled, drivers like > NVMe started relying on it, and no longer handle a dma_set_mask > failure properly. ... and architectures started to add to this breakage, not handling dma_set_mask() as documented. As for PCI_DMA_BUS_IS_PHYS - ironically, what all current usages of this macro in the kernel do is - *disable* bounce buffers in block layer if PCI_DMA_BUS_IS_PHYS is zero. Defining it to zero (as arm64 currently does) on system with memory above 4G makes all block drivers to depend on swiotlb (or iommu). Affected drivers are SCSI and IDE. > We may need to audit how drivers typically handle dma_set_mask() > failure. The NVMe driver in its current state will probably cause > silent data corruption when used on a 64-bit architecture that has > a 32-bit bus but neither swiotlb nor iommu enabled at runtime. With current code NVME causes system memory breakage even if swiotlb is there - because it's dma_set_mask_and_coherent(DMA_BIT_MASK(64)) call has effect of silent disable of swiotlb. > I would argue that the driver should be fixed to either refuse > working in that configuration to avoid data corruption, or that > it should implement bounce buffering like SCSI does. Difference from "SCSI" (actually - from block drivers that work) is in that dma_set_mask_and_coherent(DMA_BIT_MASK(64)) call: driver that does not do it works, driver that does it fails. Per documentation, driver *should* do it if it's hardware supports 64-bit dma, and platform *should* either fail this call, or ensure that 64-bit addresses can be dma_map()ed successfully. So what we have on arm64 is - drivers that follow documented procedure fail, drivers that don't follow it work, That's nonsense. > If we make it > simply not work, then your suggestion of making dma_set_mask() > fail will break your system in a different way. Proper fix should fix *both* architecture and NVMe. - architecture should stop breaking 64-bit DMA when driver attempts to set 64-bit dma mask, - NVMe should issue proper blk_queue_bounce_limit() call based on what is actually set mask, - and blk_queue_bounce_limit() should also be fixed to actually set 0xffffffff limit, instead of replacing it with (max_low_pfn << PAGE_SHIFT) as it does now. >> Still current code does not work, thus fix is needed. >> >> Perhaps need to introduce some generic API to "allocate memory best >> suited for DMA to particular device", and fix allocation points (in >> drivers, filesystems, etc) to use it. Such an API could try to allocate >> area that can be DMAed by hardware, and fallback to other memory that >> can be used via swiotlb or other bounce buffer implementation. > > The DMA mapping API is meant to do this, but we can definitely improve > it or clarify some of the rules. DMA mapping API can't help here, it's about mapping, not about allocation. What I mean is some API to allocate memory for use with streaming DMA in such way that bounce buffers won't be needed. There are many cases when at buffer allocation time, it is already known that buffer will be used for DMA with particular device. Bounce buffers will still be needed cases when no such information is available at allocation time, or when there is no directly-DMAable memory available at allocation time. >> But for now, have to stay with dma masks. Will follow-up with a patch >> based on your but with coherent mask handling added. > > Ok. Already posted. Can we have that merged? At least it will make things to stop breaking memory and start working. Nikita ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/2] arm64: dma_mapping: allow PCI host driver to limit DMA mask 2017-01-09 20:34 ` [PATCH 1/2] arm64: dma_mapping: allow PCI host driver to limit DMA mask Nikita Yushchenko @ 2017-01-09 20:57 ` Christoph Hellwig [not found] ` <e084dbad-29ab-25bd-5e17-da0fcd92f7ac@cogentembedded.com> 2017-01-10 10:47 ` [PATCH 1/2] arm64: dma_mapping: allow PCI host driver to limit DMA mask Arnd Bergmann 0 siblings, 2 replies; 14+ messages in thread From: Christoph Hellwig @ 2017-01-09 20:57 UTC (permalink / raw) On Mon, Jan 09, 2017@11:34:55PM +0300, Nikita Yushchenko wrote: > I believe the bounce buffering code you refer to is not in SATA/SCSI/MMC > but in block layer, in particular it should be controlled by > blk_queue_bounce_limit(). [Yes there is CONFIG_MMC_BLOCK_BOUNCE but it > is something completely different, namely it is for request merging for > hw not supporting scatter-gather]. And NVMe also uses block layer and > thus should get same support. NVMe shouldn't have to call blk_queue_bounce_limit - blk_queue_bounce_limit is to set the DMA addressing limit of the device. NVMe devices must support unlimited 64-bit addressing and thus calling blk_queue_bounce_limit from NVMe does not make sense. That being said currently the default for a queue without a call to blk_queue_make_request which does the wrong thing on highmem setups, so we should fix it. In fact BLK_BOUNCE_HIGH as-is doesn't really make much sense these days as no driver should ever dereference pages passed to it directly. > Maybe fixing that, together with making NVMe use this API, could stop it > from issuing dma_map()s of addresses beyond mask. NVMe should never bounce, the fact that it currently possibly does for highmem pages is a bug. > As for PCI_DMA_BUS_IS_PHYS - ironically, what all current usages of this > macro in the kernel do is - *disable* bounce buffers in block layer if > PCI_DMA_BUS_IS_PHYS is zero. That's not ironic but the whole point of the macro (horrible name and the fact that it should be a dma_ops setting aside). > - architecture should stop breaking 64-bit DMA when driver attempts to > set 64-bit dma mask, > > - NVMe should issue proper blk_queue_bounce_limit() call based on what > is actually set mask, Or even better remove the call to dma_set_mask_and_coherent with DMA_BIT_MASK(32). NVMe is designed around having proper 64-bit DMA addressing, there is not point in trying to pretent it works without that > - and blk_queue_bounce_limit() should also be fixed to actually set > 0xffffffff limit, instead of replacing it with (max_low_pfn << > PAGE_SHIFT) as it does now. We need to kill off BLK_BOUNCE_HIGH, it just doesn't make sense to mix the highmem aspect with the addressing limits. In fact the whole block bouncing scheme doesn't make much sense at all these days, we should rely on swiotlb instead. > What I mean is some API to allocate memory for use with streaming DMA in > such way that bounce buffers won't be needed. There are many cases when > at buffer allocation time, it is already known that buffer will be used > for DMA with particular device. Bounce buffers will still be needed > cases when no such information is available at allocation time, or when > there is no directly-DMAable memory available at allocation time. For block I/O that is never the case. ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <e084dbad-29ab-25bd-5e17-da0fcd92f7ac@cogentembedded.com>]
* NVMe vs DMA addressing limitations [not found] ` <e084dbad-29ab-25bd-5e17-da0fcd92f7ac@cogentembedded.com> @ 2017-01-10 7:07 ` Christoph Hellwig 2017-01-10 7:31 ` Nikita Yushchenko 2017-01-10 10:54 ` Arnd Bergmann 0 siblings, 2 replies; 14+ messages in thread From: Christoph Hellwig @ 2017-01-10 7:07 UTC (permalink / raw) On Tue, Jan 10, 2017@09:47:21AM +0300, Nikita Yushchenko wrote: > I'm now working with HW that: > - is now way "low end" or "obsolete", it has 4G of RAM and 8 CPU cores, > and is being manufactured and developed, > - has 75% of it's RAM located beyond first 4G of address space, > - can't physically handle incoming PCIe transactions addressed to memory > beyond 4G. It might not be low end or obselete, but it's absolutely braindead. Your I/O performance will suffer badly for the life of the platform because someone tries to save 2 cents, and there is not much we can do about it. > (1) it constantly runs of swiotlb space, logs are full of warnings > despite of rate limiting, > Per my current understanding, blk-level bounce buffering will at least > help with (1) - if done properly it will allocate bounce buffers within > entire memory below 4G, not within dedicated swiotlb space (that is > small and enlarging it makes memory permanently unavailable for other > use). This looks simple and safe (in sense of not anyhow breaking > unrelated use cases). Yes. Although there is absolutely no reason why swiotlb could not do the same. > (2) it runs far suboptimal due to bounce-buffering almost all i/o, > despite of lots of free memory in area where direct DMA is possible. > Addressing (2) looks much more difficult because different memory > allocation policy is required for that. It's basically not possible. Every piece of memory in a Linux kernel is a possible source of I/O, and depending on the workload type it might even be a the prime source of I/O. > > NVMe should never bounce, the fact that it currently possibly does > > for highmem pages is a bug. > > The entire topic is absolutely not related to highmem (i.e. memory not > directly addressable by 32-bit kernel). I did not say this affects you, but thanks to your mail I noticed that NVMe has a suboptimal setting there. Also note that highmem does not have to imply a 32-bit kernel, just physical memory that is not in the kernel mapping. > What we are discussing is hw-originated restriction on where DMA is > possible. Yes, where hw means the SOC, and not the actual I/O device, which is an important distinction. > > Or even better remove the call to dma_set_mask_and_coherent with > > DMA_BIT_MASK(32). NVMe is designed around having proper 64-bit DMA > > addressing, there is not point in trying to pretent it works without that > > Are you claiming that NVMe driver in mainline is intentionally designed > to not work on HW that can't do DMA to entire 64-bit space? It is not intenteded to handle the case where the SOC / chipset can't handle DMA to all physical memoery, yes. > Such setups do exist and there is interest to make them working. Sure, but it's not the job of the NVMe driver to work around such a broken system. It's something your architecture code needs to do, maybe with a bit of core kernel support. > Quite a few pages used for block I/O are allocated by filemap code - and > at allocation point it is known what inode page is being allocated for. > If this inode is from filesystem located on a known device with known > DMA limitations, this knowledge can be used to allocate page that can be > DMAed directly. But in other cases we might never DMA to it. Or we rarely DMA to it, say for a machine running databses or qemu and using lots of direct I/O. Or a storage target using it's local alloc_pages buffers. > Sure there are lots of cases when at allocation time there is no idea > what device will run DMA on page being allocated, or perhaps page is > going to be shared, or whatever. Such cases unavoidably require bounce > buffers if page ends to be used with device with DMA limitations. But > still there are cases when better allocation can remove need for bounce > buffers - without any hurt for other cases. It takes your max 1GB DMA addressable memoery away from other uses, and introduce the crazy highmem VM tuning issues we had with big 32-bit x86 systems in the past. ^ permalink raw reply [flat|nested] 14+ messages in thread
* NVMe vs DMA addressing limitations 2017-01-10 7:07 ` NVMe vs DMA addressing limitations Christoph Hellwig @ 2017-01-10 7:31 ` Nikita Yushchenko 2017-01-10 11:01 ` Arnd Bergmann 2017-01-10 10:54 ` Arnd Bergmann 1 sibling, 1 reply; 14+ messages in thread From: Nikita Yushchenko @ 2017-01-10 7:31 UTC (permalink / raw) Christoph, thanks for clear input. Arnd, I think that given this discussion, best short-term solution is indeed the patch I've submitted yesterday. That is, your version + coherent mask support. With that, set_dma_mask(DMA_BIT_MASK(64)) will succeed and hardware with work with swiotlb. Possible next step is to teach swiotlb to dynamically allocate bounce buffers within entire arm64's ZONE_DMA. Also there is some hope that R-Car *can* iommu-translate addresses that PCIe module issues to system bus. Although previous attempts to make that working failed. Additional research is needed here. Nikita > On Tue, Jan 10, 2017@09:47:21AM +0300, Nikita Yushchenko wrote: >> I'm now working with HW that: >> - is now way "low end" or "obsolete", it has 4G of RAM and 8 CPU cores, >> and is being manufactured and developed, >> - has 75% of it's RAM located beyond first 4G of address space, >> - can't physically handle incoming PCIe transactions addressed to memory >> beyond 4G. > > It might not be low end or obselete, but it's absolutely braindead. > Your I/O performance will suffer badly for the life of the platform > because someone tries to save 2 cents, and there is not much we can do > about it. > >> (1) it constantly runs of swiotlb space, logs are full of warnings >> despite of rate limiting, > >> Per my current understanding, blk-level bounce buffering will at least >> help with (1) - if done properly it will allocate bounce buffers within >> entire memory below 4G, not within dedicated swiotlb space (that is >> small and enlarging it makes memory permanently unavailable for other >> use). This looks simple and safe (in sense of not anyhow breaking >> unrelated use cases). > > Yes. Although there is absolutely no reason why swiotlb could not > do the same. > >> (2) it runs far suboptimal due to bounce-buffering almost all i/o, >> despite of lots of free memory in area where direct DMA is possible. > >> Addressing (2) looks much more difficult because different memory >> allocation policy is required for that. > > It's basically not possible. Every piece of memory in a Linux > kernel is a possible source of I/O, and depending on the workload > type it might even be a the prime source of I/O. > >>> NVMe should never bounce, the fact that it currently possibly does >>> for highmem pages is a bug. >> >> The entire topic is absolutely not related to highmem (i.e. memory not >> directly addressable by 32-bit kernel). > > I did not say this affects you, but thanks to your mail I noticed that > NVMe has a suboptimal setting there. Also note that highmem does not > have to imply a 32-bit kernel, just physical memory that is not in the > kernel mapping. > >> What we are discussing is hw-originated restriction on where DMA is >> possible. > > Yes, where hw means the SOC, and not the actual I/O device, which is an > important distinction. > >>> Or even better remove the call to dma_set_mask_and_coherent with >>> DMA_BIT_MASK(32). NVMe is designed around having proper 64-bit DMA >>> addressing, there is not point in trying to pretent it works without that >> >> Are you claiming that NVMe driver in mainline is intentionally designed >> to not work on HW that can't do DMA to entire 64-bit space? > > It is not intenteded to handle the case where the SOC / chipset > can't handle DMA to all physical memoery, yes. > >> Such setups do exist and there is interest to make them working. > > Sure, but it's not the job of the NVMe driver to work around such a broken > system. It's something your architecture code needs to do, maybe with > a bit of core kernel support. > >> Quite a few pages used for block I/O are allocated by filemap code - and >> at allocation point it is known what inode page is being allocated for. >> If this inode is from filesystem located on a known device with known >> DMA limitations, this knowledge can be used to allocate page that can be >> DMAed directly. > > But in other cases we might never DMA to it. Or we rarely DMA to it, say > for a machine running databses or qemu and using lots of direct I/O. Or > a storage target using it's local alloc_pages buffers. > >> Sure there are lots of cases when at allocation time there is no idea >> what device will run DMA on page being allocated, or perhaps page is >> going to be shared, or whatever. Such cases unavoidably require bounce >> buffers if page ends to be used with device with DMA limitations. But >> still there are cases when better allocation can remove need for bounce >> buffers - without any hurt for other cases. > > It takes your max 1GB DMA addressable memoery away from other uses, > and introduce the crazy highmem VM tuning issues we had with big > 32-bit x86 systems in the past. > ^ permalink raw reply [flat|nested] 14+ messages in thread
* NVMe vs DMA addressing limitations 2017-01-10 7:31 ` Nikita Yushchenko @ 2017-01-10 11:01 ` Arnd Bergmann 2017-01-10 14:48 ` Christoph Hellwig 0 siblings, 1 reply; 14+ messages in thread From: Arnd Bergmann @ 2017-01-10 11:01 UTC (permalink / raw) On Tuesday, January 10, 2017 10:31:47 AM CET Nikita Yushchenko wrote: > Christoph, thanks for clear input. > > Arnd, I think that given this discussion, best short-term solution is > indeed the patch I've submitted yesterday. That is, your version + > coherent mask support. With that, set_dma_mask(DMA_BIT_MASK(64)) will > succeed and hardware with work with swiotlb. Ok, good. > Possible next step is to teach swiotlb to dynamically allocate bounce > buffers within entire arm64's ZONE_DMA. That seems reasonable, yes. We probably have to do both, as there are cases where a device has dma_mask smaller than ZONE_DMA but the swiotlb bounce area is low enough to work anyway. Another workaround me might need is to limit amount of concurrent DMA in the NVMe driver based on some platform quirk. The way that NVMe works, it can have very large amounts of data that is concurrently mapped into the device. SWIOTLB is one case where this currently fails, but another example would be old PowerPC servers that have a 256MB window of virtual I/O addresses per VM guest in their IOMMU. Those will likely fail the same way that your does. > Also there is some hope that R-Car *can* iommu-translate addresses that > PCIe module issues to system bus. Although previous attempts to make > that working failed. Additional research is needed here. Does this IOMMU support remapping data within a virtual machine? I believe there are some that only do one of the two -- either you can have guest machines with DMA access to their low memory, or you can remap data on the fly in the host. Arnd ^ permalink raw reply [flat|nested] 14+ messages in thread
* NVMe vs DMA addressing limitations 2017-01-10 11:01 ` Arnd Bergmann @ 2017-01-10 14:48 ` Christoph Hellwig 2017-01-10 15:02 ` Arnd Bergmann 2017-01-12 10:09 ` Sagi Grimberg 0 siblings, 2 replies; 14+ messages in thread From: Christoph Hellwig @ 2017-01-10 14:48 UTC (permalink / raw) On Tue, Jan 10, 2017@12:01:05PM +0100, Arnd Bergmann wrote: > Another workaround me might need is to limit amount of concurrent DMA > in the NVMe driver based on some platform quirk. The way that NVMe works, > it can have very large amounts of data that is concurrently mapped into > the device. That's not really just NVMe - other storage and network controllers also can DMA map giant amounts of memory. There are a couple aspects to it: - dma coherent memoery - right now NVMe doesn't use too much of it, but upcoming low-end NVMe controllers will soon start to require fairl large amounts of it for the host memory buffer feature that allows for DRAM-less controller designs. As an interesting quirk that is memory only used by the PCIe devices, and never accessed by the Linux host at all. - size vs number of the dynamic mapping. We probably want the dma_ops specify a maximum mapping size for a given device. As long as we can make progress with a few mappings swiotlb / the iommu can just fail mapping and the driver will propagate that to the block layer that throttles I/O. ^ permalink raw reply [flat|nested] 14+ messages in thread
* NVMe vs DMA addressing limitations 2017-01-10 14:48 ` Christoph Hellwig @ 2017-01-10 15:02 ` Arnd Bergmann 2017-01-12 10:09 ` Sagi Grimberg 1 sibling, 0 replies; 14+ messages in thread From: Arnd Bergmann @ 2017-01-10 15:02 UTC (permalink / raw) On Tuesday, January 10, 2017 3:48:39 PM CET Christoph Hellwig wrote: > On Tue, Jan 10, 2017@12:01:05PM +0100, Arnd Bergmann wrote: > > Another workaround me might need is to limit amount of concurrent DMA > > in the NVMe driver based on some platform quirk. The way that NVMe works, > > it can have very large amounts of data that is concurrently mapped into > > the device. > > That's not really just NVMe - other storage and network controllers also > can DMA map giant amounts of memory. There are a couple aspects to it: > > - dma coherent memoery - right now NVMe doesn't use too much of it, > but upcoming low-end NVMe controllers will soon start to require > fairl large amounts of it for the host memory buffer feature that > allows for DRAM-less controller designs. As an interesting quirk > that is memory only used by the PCIe devices, and never accessed > by the Linux host at all. Right, that is going to become interesting, as some platforms are very limited with their coherent allocations. > - size vs number of the dynamic mapping. We probably want the dma_ops > specify a maximum mapping size for a given device. As long as we > can make progress with a few mappings swiotlb / the iommu can just > fail mapping and the driver will propagate that to the block layer > that throttles I/O. Good idea. Arnd ^ permalink raw reply [flat|nested] 14+ messages in thread
* NVMe vs DMA addressing limitations 2017-01-10 14:48 ` Christoph Hellwig 2017-01-10 15:02 ` Arnd Bergmann @ 2017-01-12 10:09 ` Sagi Grimberg 2017-01-12 11:56 ` Arnd Bergmann 1 sibling, 1 reply; 14+ messages in thread From: Sagi Grimberg @ 2017-01-12 10:09 UTC (permalink / raw) >> Another workaround me might need is to limit amount of concurrent DMA >> in the NVMe driver based on some platform quirk. The way that NVMe works, >> it can have very large amounts of data that is concurrently mapped into >> the device. > > That's not really just NVMe - other storage and network controllers also > can DMA map giant amounts of memory. There are a couple aspects to it: > > - dma coherent memoery - right now NVMe doesn't use too much of it, > but upcoming low-end NVMe controllers will soon start to require > fairl large amounts of it for the host memory buffer feature that > allows for DRAM-less controller designs. As an interesting quirk > that is memory only used by the PCIe devices, and never accessed > by the Linux host at all. Would it make sense to convert the nvme driver to use normal allocations and use the DMA streaming APIs (dma_sync_single_for_[cpu|device]) for both queues and future HMB? > - size vs number of the dynamic mapping. We probably want the dma_ops > specify a maximum mapping size for a given device. As long as we > can make progress with a few mappings swiotlb / the iommu can just > fail mapping and the driver will propagate that to the block layer > that throttles I/O. Isn't max mapping size per device too restrictive? it is possible that not all devices posses active mappings concurrently. ^ permalink raw reply [flat|nested] 14+ messages in thread
* NVMe vs DMA addressing limitations 2017-01-12 10:09 ` Sagi Grimberg @ 2017-01-12 11:56 ` Arnd Bergmann 2017-01-12 13:07 ` Christoph Hellwig 0 siblings, 1 reply; 14+ messages in thread From: Arnd Bergmann @ 2017-01-12 11:56 UTC (permalink / raw) On Thursday, January 12, 2017 12:09:11 PM CET Sagi Grimberg wrote: > >> Another workaround me might need is to limit amount of concurrent DMA > >> in the NVMe driver based on some platform quirk. The way that NVMe works, > >> it can have very large amounts of data that is concurrently mapped into > >> the device. > > > > That's not really just NVMe - other storage and network controllers also > > can DMA map giant amounts of memory. There are a couple aspects to it: > > > > - dma coherent memoery - right now NVMe doesn't use too much of it, > > but upcoming low-end NVMe controllers will soon start to require > > fairl large amounts of it for the host memory buffer feature that > > allows for DRAM-less controller designs. As an interesting quirk > > that is memory only used by the PCIe devices, and never accessed > > by the Linux host at all. > > Would it make sense to convert the nvme driver to use normal allocations > and use the DMA streaming APIs (dma_sync_single_for_[cpu|device]) for > both queues and future HMB? That is an interesting question: We actually have the "DMA_ATTR_NO_KERNEL_MAPPING" for this case, and ARM implements it in the coherent interface, so that might be a good fit. Implementing it in the streaming API makes no sense since we already have a kernel mapping here, but using a normal allocation (possibly with DMA_ATTR_NON_CONSISTENT or DMA_ATTR_SKIP_CPU_SYNC, need to check) might help on other architectures that have limited amounts of coherent memory and no CMA. Another benefit of the coherent API for this kind of buffer is that we can use CMA where available to get a large consecutive chunk of RAM on architectures without an IOMMU when normal memory is no longer available because of fragmentation. Arnd ^ permalink raw reply [flat|nested] 14+ messages in thread
* NVMe vs DMA addressing limitations 2017-01-12 11:56 ` Arnd Bergmann @ 2017-01-12 13:07 ` Christoph Hellwig 0 siblings, 0 replies; 14+ messages in thread From: Christoph Hellwig @ 2017-01-12 13:07 UTC (permalink / raw) On Thu, Jan 12, 2017@12:56:07PM +0100, Arnd Bergmann wrote: > That is an interesting question: We actually have the > "DMA_ATTR_NO_KERNEL_MAPPING" for this case, and ARM implements > it in the coherent interface, so that might be a good fit. Yes. my WIP HMB patch uses DMA_ATTR_NO_KERNEL_MAPPING, although I'm workin on x86 at the moment where it's a no-op. > Implementing it in the streaming API makes no sense since we > already have a kernel mapping here, but using a normal allocation > (possibly with DMA_ATTR_NON_CONSISTENT or DMA_ATTR_SKIP_CPU_SYNC, > need to check) might help on other architectures that have > limited amounts of coherent memory and no CMA. Though about that - but in the end DMA_ATTR_NO_KERNEL_MAPPING implies those, so instead of using lots of flags in driver I'd rather fix up more dma_ops implementations to take advantage of DMA_ATTR_NO_KERNEL_MAPPING. ^ permalink raw reply [flat|nested] 14+ messages in thread
* NVMe vs DMA addressing limitations 2017-01-10 7:07 ` NVMe vs DMA addressing limitations Christoph Hellwig 2017-01-10 7:31 ` Nikita Yushchenko @ 2017-01-10 10:54 ` Arnd Bergmann 1 sibling, 0 replies; 14+ messages in thread From: Arnd Bergmann @ 2017-01-10 10:54 UTC (permalink / raw) On Tuesday, January 10, 2017 8:07:20 AM CET Christoph Hellwig wrote: > On Tue, Jan 10, 2017@09:47:21AM +0300, Nikita Yushchenko wrote: > > I'm now working with HW that: > > - is now way "low end" or "obsolete", it has 4G of RAM and 8 CPU cores, > > and is being manufactured and developed, > > - has 75% of it's RAM located beyond first 4G of address space, > > - can't physically handle incoming PCIe transactions addressed to memory > > beyond 4G. > > It might not be low end or obselete, but it's absolutely braindead. > Your I/O performance will suffer badly for the life of the platform > because someone tries to save 2 cents, and there is not much we can do > about it. Unfortunately it is a common problem for arm64 chips that were designed by taking a 32-bit SoC and replacing the CPU core. The swiotlb is the right workaround for this, and I think we all agree that we should just make it work correctly. Arnd ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/2] arm64: dma_mapping: allow PCI host driver to limit DMA mask 2017-01-09 20:57 ` Christoph Hellwig [not found] ` <e084dbad-29ab-25bd-5e17-da0fcd92f7ac@cogentembedded.com> @ 2017-01-10 10:47 ` Arnd Bergmann 2017-01-10 14:44 ` Christoph Hellwig 1 sibling, 1 reply; 14+ messages in thread From: Arnd Bergmann @ 2017-01-10 10:47 UTC (permalink / raw) On Monday, January 9, 2017 9:57:46 PM CET Christoph Hellwig wrote: > > - architecture should stop breaking 64-bit DMA when driver attempts to > > set 64-bit dma mask, > > > > - NVMe should issue proper blk_queue_bounce_limit() call based on what > > is actually set mask, > > Or even better remove the call to dma_set_mask_and_coherent with > DMA_BIT_MASK(32). NVMe is designed around having proper 64-bit DMA > addressing, there is not point in trying to pretent it works without that Agreed, let's just fail the probe() if DMA_BIT_MASK(64) fails, and have swiotlb work around machines that for some reason need bounce buffers. > > - and blk_queue_bounce_limit() should also be fixed to actually set > > 0xffffffff limit, instead of replacing it with (max_low_pfn << > > PAGE_SHIFT) as it does now. > > We need to kill off BLK_BOUNCE_HIGH, it just doesn't make sense to > mix the highmem aspect with the addressing limits. In fact the whole > block bouncing scheme doesn't make much sense at all these days, we > should rely on swiotlb instead. If we do this, we should probably have another look at the respective NETIF_F_HIGHDMA support in the network stack, which does the same thing and mixes up highmem on 32-bit architectures with the DMA address limit. (side note: there are actually cases in which you have a 31-bit DMA mask but 3 GB of lowmem using CONFIG_VMSPLIT_1G, so BLK_BOUNCE_HIGH and !NETIF_F_HIGHDMA are both missing the limit, causing data corruption without swiotlb). Before we rely too much on swiotlb, we may also need to consider which architectures today rely on bouncing in blk and network. I see that we have CONFIG_ARCH_PHYS_ADDR_T_64BIT on a couple of 32-bit architectures without swiotlb (arc, arm, some mips32), and there are several 64-bit architectures that do not have swiotlb (alpha, parisc, s390, sparc). I believe that alpha, s390 and sparc always use some form of IOMMU, but the other four apparently don't, so we would need to add swiotlb support there to remove all the bounce buffering in network and block layers. Arnd ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/2] arm64: dma_mapping: allow PCI host driver to limit DMA mask 2017-01-10 10:47 ` [PATCH 1/2] arm64: dma_mapping: allow PCI host driver to limit DMA mask Arnd Bergmann @ 2017-01-10 14:44 ` Christoph Hellwig 2017-01-10 15:00 ` Arnd Bergmann 0 siblings, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2017-01-10 14:44 UTC (permalink / raw) On Tue, Jan 10, 2017@11:47:42AM +0100, Arnd Bergmann wrote: > I see that we have CONFIG_ARCH_PHYS_ADDR_T_64BIT on a couple of > 32-bit architectures without swiotlb (arc, arm, some mips32), and > there are several 64-bit architectures that do not have swiotlb > (alpha, parisc, s390, sparc). I believe that alpha, s390 and sparc > always use some form of IOMMU, but the other four apparently don't, > so we would need to add swiotlb support there to remove all the > bounce buffering in network and block layers. mips has lots of weird swiotlb wire-up in it's board code (the swiotlb arch glue really needs some major cleanup..), as does arm. Not sure about the others. Getting rid of highmem bouncing in the block layer will take some time as various PIO-only drivers rely on it at the moment. These should all be convertable to kmap that data, but it needs a careful audit first. For 4.11 I'll plan to switch away from bouncing highmem by default at least, though and maybe also convert a few PIO drivers. ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/2] arm64: dma_mapping: allow PCI host driver to limit DMA mask 2017-01-10 14:44 ` Christoph Hellwig @ 2017-01-10 15:00 ` Arnd Bergmann 0 siblings, 0 replies; 14+ messages in thread From: Arnd Bergmann @ 2017-01-10 15:00 UTC (permalink / raw) On Tuesday, January 10, 2017 3:44:53 PM CET Christoph Hellwig wrote: > On Tue, Jan 10, 2017@11:47:42AM +0100, Arnd Bergmann wrote: > > I see that we have CONFIG_ARCH_PHYS_ADDR_T_64BIT on a couple of > > 32-bit architectures without swiotlb (arc, arm, some mips32), and > > there are several 64-bit architectures that do not have swiotlb > > (alpha, parisc, s390, sparc). I believe that alpha, s390 and sparc > > always use some form of IOMMU, but the other four apparently don't, > > so we would need to add swiotlb support there to remove all the > > bounce buffering in network and block layers. > > mips has lots of weird swiotlb wire-up in it's board code (the swiotlb > arch glue really needs some major cleanup..), My reading of the MIPS code was that only the 64-bit platforms use it, but there are a number of 32-bit platforms that have 64-bit physical addresses and don't. > as does arm. Not sure about the others. 32-bit ARM doesn't actually use SWIOTLB at all, despite selecting it in Kconfig. I think Xen uses it for its own purposes, but nothing else does. Most ARM platforms can't actually have RAM beyond 4GB, and the ones that do have it tend to also come with an IOMMU, but I remember at least BCM53xx actually needing swiotlb on some chip revisions that are widely used and that cannot DMA to the second memory bank from PCI (!). Arnd ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2017-01-12 13:07 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1483044304-2085-1-git-send-email-nikita.yoush@cogentembedded.com>
[not found] ` <2723285.JORgusvJv4@wuerfel>
[not found] ` <9a03c05d-ad4c-0547-d1fe-01edb8b082d6@cogentembedded.com>
[not found] ` <6374144.HVL0QxNJiT@wuerfel>
2017-01-09 20:34 ` [PATCH 1/2] arm64: dma_mapping: allow PCI host driver to limit DMA mask Nikita Yushchenko
2017-01-09 20:57 ` Christoph Hellwig
[not found] ` <e084dbad-29ab-25bd-5e17-da0fcd92f7ac@cogentembedded.com>
2017-01-10 7:07 ` NVMe vs DMA addressing limitations Christoph Hellwig
2017-01-10 7:31 ` Nikita Yushchenko
2017-01-10 11:01 ` Arnd Bergmann
2017-01-10 14:48 ` Christoph Hellwig
2017-01-10 15:02 ` Arnd Bergmann
2017-01-12 10:09 ` Sagi Grimberg
2017-01-12 11:56 ` Arnd Bergmann
2017-01-12 13:07 ` Christoph Hellwig
2017-01-10 10:54 ` Arnd Bergmann
2017-01-10 10:47 ` [PATCH 1/2] arm64: dma_mapping: allow PCI host driver to limit DMA mask Arnd Bergmann
2017-01-10 14:44 ` Christoph Hellwig
2017-01-10 15:00 ` Arnd Bergmann
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox