* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping [not found] ` <562E5AE4.9070001@arm.com> @ 2015-10-30 1:17 ` Daniel Kurtz 2015-10-30 14:09 ` Joerg Roedel ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: Daniel Kurtz @ 2015-10-30 1:17 UTC (permalink / raw) To: Robin Murphy, Pawel Osciak Cc: Yong Wu, Joerg Roedel, Will Deacon, Catalin Marinas, open list:IOMMU DRIVERS, linux-arm-kernel@lists.infradead.org, thunder.leizhen, Yingjoe Chen, laurent.pinchart+renesas, Thierry Reding, Lin PoChun, Bobby Batacharia (via Google Docs), linux-media, Marek Szyprowski, Kyungmin Park +linux-media & VIDEOBUF2 FRAMEWORK maintainers since this is about the v4l2-contig's usage of the DMA API. Hi Robin, On Tue, Oct 27, 2015 at 12:55 AM, Robin Murphy <robin.murphy@arm.com> wrote: > On 26/10/15 13:44, Yong Wu wrote: >> >> On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote: >> [...] >>> >>> +/* >>> + * The DMA API client is passing in a scatterlist which could describe >>> + * any old buffer layout, but the IOMMU API requires everything to be >>> + * aligned to IOMMU pages. Hence the need for this complicated bit of >>> + * impedance-matching, to be able to hand off a suitably-aligned list, >>> + * but still preserve the original offsets and sizes for the caller. >>> + */ >>> +int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, >>> + int nents, int prot) >>> +{ >>> + struct iommu_domain *domain = iommu_get_domain_for_dev(dev); >>> + struct iova_domain *iovad = domain->iova_cookie; >>> + struct iova *iova; >>> + struct scatterlist *s, *prev = NULL; >>> + dma_addr_t dma_addr; >>> + size_t iova_len = 0; >>> + int i; >>> + >>> + /* >>> + * Work out how much IOVA space we need, and align the segments >>> to >>> + * IOVA granules for the IOMMU driver to handle. With some clever >>> + * trickery we can modify the list in-place, but reversibly, by >>> + * hiding the original data in the as-yet-unused DMA fields. >>> + */ >>> + for_each_sg(sg, s, nents, i) { >>> + size_t s_offset = iova_offset(iovad, s->offset); >>> + size_t s_length = s->length; >>> + >>> + sg_dma_address(s) = s->offset; >>> + sg_dma_len(s) = s_length; >>> + s->offset -= s_offset; >>> + s_length = iova_align(iovad, s_length + s_offset); >>> + s->length = s_length; >>> + >>> + /* >>> + * The simple way to avoid the rare case of a segment >>> + * crossing the boundary mask is to pad the previous one >>> + * to end at a naturally-aligned IOVA for this one's >>> size, >>> + * at the cost of potentially over-allocating a little. >>> + */ >>> + if (prev) { >>> + size_t pad_len = roundup_pow_of_two(s_length); >>> + >>> + pad_len = (pad_len - iova_len) & (pad_len - 1); >>> + prev->length += pad_len; >> >> >> Hi Robin, >> While our v4l2 testing, It seems that we met a problem here. >> Here we update prev->length again, Do we need update >> sg_dma_len(prev) again too? >> >> Some function like vb2_dc_get_contiguous_size[1] always get >> sg_dma_len(s) to compare instead of s->length. so it may break >> unexpectedly while sg_dma_len(s) is not same with s->length. > > > This is just tweaking the faked-up length that we hand off to iommu_map_sg() > (see also the iova_align() above), to trick it into bumping this segment up > to a suitable starting IOVA. The real length at this point is stashed in > sg_dma_len(s), and will be copied back into s->length in __finalise_sg(), so > both will hold the same true length once we return to the caller. > > Yes, it does mean that if you have a list where the segment lengths are page > aligned but not monotonically decreasing, e.g. {64k, 16k, 64k}, then you'll > still end up with a gap between the second and third segments, but that's > fine because the DMA API offers no guarantees about what the resulting DMA > addresses will be (consider the no-IOMMU case where they would each just be > "mapped" to their physical address). If that breaks v4l, then it's probably > v4l's DMA API use that needs looking at (again). Hmm, I thought the DMA API maps a (possibly) non-contiguous set of memory pages into a contiguous block in device memory address space. This would allow passing a dma mapped buffer to device dma using just a device address and length. IIUC, the change above breaks this model by inserting gaps in how the buffer is mapped to device memory, such that the buffer is no longer contiguous in dma address space. Here is the code in question from drivers/media/v4l2-core/videobuf2-dma-contig.c : static unsigned long vb2_dc_get_contiguous_size(struct sg_table *sgt) { struct scatterlist *s; dma_addr_t expected = sg_dma_address(sgt->sgl); unsigned int i; unsigned long size = 0; for_each_sg(sgt->sgl, s, sgt->nents, i) { if (sg_dma_address(s) != expected) break; expected = sg_dma_address(s) + sg_dma_len(s); size += sg_dma_len(s); } return size; } static void *vb2_dc_get_userptr(void *alloc_ctx, unsigned long vaddr, unsigned long size, enum dma_data_direction dma_dir) { struct vb2_dc_conf *conf = alloc_ctx; struct vb2_dc_buf *buf; struct frame_vector *vec; unsigned long offset; int n_pages, i; int ret = 0; struct sg_table *sgt; unsigned long contig_size; unsigned long dma_align = dma_get_cache_alignment(); DEFINE_DMA_ATTRS(attrs); dma_set_attr(DMA_ATTR_SKIP_CPU_SYNC, &attrs); buf = kzalloc(sizeof *buf, GFP_KERNEL); buf->dma_dir = dma_dir; offset = vaddr & ~PAGE_MASK; vec = vb2_create_framevec(vaddr, size, dma_dir == DMA_FROM_DEVICE); buf->vec = vec; n_pages = frame_vector_count(vec); sgt = kzalloc(sizeof(*sgt), GFP_KERNEL); ret = sg_alloc_table_from_pages(sgt, frame_vector_pages(vec), n_pages, offset, size, GFP_KERNEL); sgt->nents = dma_map_sg_attrs(buf->dev, sgt->sgl, sgt->orig_nents, buf->dma_dir, &attrs); contig_size = vb2_dc_get_contiguous_size(sgt); if (contig_size < size) { <<<=== if the original buffer had sg entries that were not aligned on the "natural" alignment for their size, the new arm64 iommu core code inserts a 'gap' in the iommu mapping, which causes vb2_dc_get_contiguous_size() to exit early (and return a smaller size than expected). pr_err("contiguous mapping is too small %lu/%lu\n", contig_size, size); ret = -EFAULT; goto fail_map_sg; } So, is the videobuf2-dma-contig.c based on an incorrect assumption about how the DMA API is supposed to work? Is it even possible to map a "contiguous-in-iova-range" mapping for a buffer given as an sg_table with an arbitrary set of pages? Thanks for helping to move this forward. -Dan ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping 2015-10-30 1:17 ` [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping Daniel Kurtz @ 2015-10-30 14:09 ` Joerg Roedel 2015-10-30 14:27 ` Robin Murphy 2015-11-17 12:02 ` Marek Szyprowski 2 siblings, 0 replies; 15+ messages in thread From: Joerg Roedel @ 2015-10-30 14:09 UTC (permalink / raw) To: Daniel Kurtz Cc: Robin Murphy, Pawel Osciak, Yong Wu, Will Deacon, Catalin Marinas, open list:IOMMU DRIVERS, linux-arm-kernel@lists.infradead.org, thunder.leizhen, Yingjoe Chen, laurent.pinchart+renesas, Thierry Reding, Lin PoChun, Bobby Batacharia (via Google Docs), linux-media, Marek Szyprowski, Kyungmin Park On Fri, Oct 30, 2015 at 09:17:52AM +0800, Daniel Kurtz wrote: > Hmm, I thought the DMA API maps a (possibly) non-contiguous set of > memory pages into a contiguous block in device memory address space. > This would allow passing a dma mapped buffer to device dma using just > a device address and length. If you are speaking of the dma_map_sg interface, than there is absolutly no guarantee from the API side that the buffers you pass in will end up mapped contiguously. IOMMU drivers handle this differently, and when there is no IOMMU at all there is also no way to map these buffers together. > So, is the videobuf2-dma-contig.c based on an incorrect assumption > about how the DMA API is supposed to work? If it makes the above assumption, then yes. Joerg ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping 2015-10-30 1:17 ` [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping Daniel Kurtz 2015-10-30 14:09 ` Joerg Roedel @ 2015-10-30 14:27 ` Robin Murphy 2015-11-02 13:11 ` Daniel Kurtz 2015-11-17 12:02 ` Marek Szyprowski 2 siblings, 1 reply; 15+ messages in thread From: Robin Murphy @ 2015-10-30 14:27 UTC (permalink / raw) To: Daniel Kurtz, Pawel Osciak Cc: Yong Wu, Joerg Roedel, Will Deacon, Catalin Marinas, open list:IOMMU DRIVERS, linux-arm-kernel@lists.infradead.org, thunder.leizhen, Yingjoe Chen, laurent.pinchart+renesas, Thierry Reding, Lin PoChun, Bobby Batacharia (via Google Docs), linux-media, Marek Szyprowski, Kyungmin Park Hi Dan, On 30/10/15 01:17, Daniel Kurtz wrote: > +linux-media & VIDEOBUF2 FRAMEWORK maintainers since this is about the > v4l2-contig's usage of the DMA API. > > Hi Robin, > > On Tue, Oct 27, 2015 at 12:55 AM, Robin Murphy <robin.murphy@arm.com> wrote: >> On 26/10/15 13:44, Yong Wu wrote: >>> >>> On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote: >>> [...] >>>> >>>> +/* >>>> + * The DMA API client is passing in a scatterlist which could describe >>>> + * any old buffer layout, but the IOMMU API requires everything to be >>>> + * aligned to IOMMU pages. Hence the need for this complicated bit of >>>> + * impedance-matching, to be able to hand off a suitably-aligned list, >>>> + * but still preserve the original offsets and sizes for the caller. >>>> + */ >>>> +int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, >>>> + int nents, int prot) >>>> +{ >>>> + struct iommu_domain *domain = iommu_get_domain_for_dev(dev); >>>> + struct iova_domain *iovad = domain->iova_cookie; >>>> + struct iova *iova; >>>> + struct scatterlist *s, *prev = NULL; >>>> + dma_addr_t dma_addr; >>>> + size_t iova_len = 0; >>>> + int i; >>>> + >>>> + /* >>>> + * Work out how much IOVA space we need, and align the segments >>>> to >>>> + * IOVA granules for the IOMMU driver to handle. With some clever >>>> + * trickery we can modify the list in-place, but reversibly, by >>>> + * hiding the original data in the as-yet-unused DMA fields. >>>> + */ >>>> + for_each_sg(sg, s, nents, i) { >>>> + size_t s_offset = iova_offset(iovad, s->offset); >>>> + size_t s_length = s->length; >>>> + >>>> + sg_dma_address(s) = s->offset; >>>> + sg_dma_len(s) = s_length; >>>> + s->offset -= s_offset; >>>> + s_length = iova_align(iovad, s_length + s_offset); >>>> + s->length = s_length; >>>> + >>>> + /* >>>> + * The simple way to avoid the rare case of a segment >>>> + * crossing the boundary mask is to pad the previous one >>>> + * to end at a naturally-aligned IOVA for this one's >>>> size, >>>> + * at the cost of potentially over-allocating a little. >>>> + */ >>>> + if (prev) { >>>> + size_t pad_len = roundup_pow_of_two(s_length); >>>> + >>>> + pad_len = (pad_len - iova_len) & (pad_len - 1); >>>> + prev->length += pad_len; >>> >>> >>> Hi Robin, >>> While our v4l2 testing, It seems that we met a problem here. >>> Here we update prev->length again, Do we need update >>> sg_dma_len(prev) again too? >>> >>> Some function like vb2_dc_get_contiguous_size[1] always get >>> sg_dma_len(s) to compare instead of s->length. so it may break >>> unexpectedly while sg_dma_len(s) is not same with s->length. >> >> >> This is just tweaking the faked-up length that we hand off to iommu_map_sg() >> (see also the iova_align() above), to trick it into bumping this segment up >> to a suitable starting IOVA. The real length at this point is stashed in >> sg_dma_len(s), and will be copied back into s->length in __finalise_sg(), so >> both will hold the same true length once we return to the caller. >> >> Yes, it does mean that if you have a list where the segment lengths are page >> aligned but not monotonically decreasing, e.g. {64k, 16k, 64k}, then you'll >> still end up with a gap between the second and third segments, but that's >> fine because the DMA API offers no guarantees about what the resulting DMA >> addresses will be (consider the no-IOMMU case where they would each just be >> "mapped" to their physical address). If that breaks v4l, then it's probably >> v4l's DMA API use that needs looking at (again). > > Hmm, I thought the DMA API maps a (possibly) non-contiguous set of > memory pages into a contiguous block in device memory address space. > This would allow passing a dma mapped buffer to device dma using just > a device address and length. Not at all. The streaming DMA API (dma_map_* and friends) has two responsibilities: performing any necessary cache maintenance to ensure the device will correctly see data from the CPU, and the CPU will correctly see data from the device; and working out an address for that buffer from the device's point of view to actually hand off to the hardware (which is perfectly well allowed to fail). Consider SWIOTLB's implementation - segments which already lie at physical addresses within the device's DMA mask just get passed through, while those that lie outside it get mapped into the bounce buffer, but still as individual allocations (arch code just handles cache maintenance on the resulting physical addresses and can apply any hard-wired DMA offset for the device concerned). > IIUC, the change above breaks this model by inserting gaps in how the > buffer is mapped to device memory, such that the buffer is no longer > contiguous in dma address space. Even the existing arch/arm IOMMU DMA code which I guess this implicitly relies on doesn't guarantee that behaviour - if the mapping happens to reach one of the segment length/boundary limits it won't just leave a gap, it'll start an entirely new IOVA allocation which could well start at a wildly different address[0]. > Here is the code in question from > drivers/media/v4l2-core/videobuf2-dma-contig.c : > > static unsigned long vb2_dc_get_contiguous_size(struct sg_table *sgt) > { > struct scatterlist *s; > dma_addr_t expected = sg_dma_address(sgt->sgl); > unsigned int i; > unsigned long size = 0; > > for_each_sg(sgt->sgl, s, sgt->nents, i) { > if (sg_dma_address(s) != expected) > break; > expected = sg_dma_address(s) + sg_dma_len(s); > size += sg_dma_len(s); > } > return size; > } > > > static void *vb2_dc_get_userptr(void *alloc_ctx, unsigned long vaddr, > unsigned long size, enum dma_data_direction dma_dir) > { > struct vb2_dc_conf *conf = alloc_ctx; > struct vb2_dc_buf *buf; > struct frame_vector *vec; > unsigned long offset; > int n_pages, i; > int ret = 0; > struct sg_table *sgt; > unsigned long contig_size; > unsigned long dma_align = dma_get_cache_alignment(); > DEFINE_DMA_ATTRS(attrs); > > dma_set_attr(DMA_ATTR_SKIP_CPU_SYNC, &attrs); > > buf = kzalloc(sizeof *buf, GFP_KERNEL); > buf->dma_dir = dma_dir; > > offset = vaddr & ~PAGE_MASK; > vec = vb2_create_framevec(vaddr, size, dma_dir == DMA_FROM_DEVICE); > buf->vec = vec; > n_pages = frame_vector_count(vec); > > sgt = kzalloc(sizeof(*sgt), GFP_KERNEL); > > ret = sg_alloc_table_from_pages(sgt, frame_vector_pages(vec), n_pages, > offset, size, GFP_KERNEL); > > sgt->nents = dma_map_sg_attrs(buf->dev, sgt->sgl, sgt->orig_nents, > buf->dma_dir, &attrs); > > contig_size = vb2_dc_get_contiguous_size(sgt); (as an aside, it's rather unintuitive that the handling of the dma_map_sg call actually failing is entirely implicit here) > if (contig_size < size) { > > <<<=== if the original buffer had sg entries that were not > aligned on the "natural" alignment for their size, the new arm64 iommu > core code inserts a 'gap' in the iommu mapping, which causes > vb2_dc_get_contiguous_size() to exit early (and return a smaller size > than expected). > > pr_err("contiguous mapping is too small %lu/%lu\n", > contig_size, size); > ret = -EFAULT; > goto fail_map_sg; > } > > > So, is the videobuf2-dma-contig.c based on an incorrect assumption > about how the DMA API is supposed to work? > Is it even possible to map a "contiguous-in-iova-range" mapping for a > buffer given as an sg_table with an arbitrary set of pages? From the Streaming DMA mappings section of Documentation/DMA-API.txt: Note also that the above constraints on physical contiguity and dma_mask may not apply if the platform has an IOMMU (a device which maps an I/O DMA address to a physical memory address). However, to be portable, device driver writers may *not* assume that such an IOMMU exists. There's not strictly any harm in using the DMA API this way and *hoping* you get what you want, as long as you're happy for it to fail pretty much 100% of the time on some systems, and still in a minority of corner cases on any system. However, if there's a real dependency on IOMMUs and tight control of IOVA allocation here, then the DMA API isn't really the right tool for the job, and maybe it's time to start looking to how to better fit these multimedia-subsystem-type use cases into the IOMMU API - as far as I understand it there's at least some conceptual overlap with the HSA PASID stuff being prototyped in PCI/x86-land at the moment, so it could be an apposite time to try and bang out some common requirements. Robin. [0]:http://article.gmane.org/gmane.linux.kernel.iommu/11185 > > Thanks for helping to move this forward. > > -Dan > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping 2015-10-30 14:27 ` Robin Murphy @ 2015-11-02 13:11 ` Daniel Kurtz 2015-11-02 13:43 ` Tomasz Figa 0 siblings, 1 reply; 15+ messages in thread From: Daniel Kurtz @ 2015-11-02 13:11 UTC (permalink / raw) To: Robin Murphy Cc: Lin PoChun, linux-arm-kernel@lists.infradead.org, Yingjoe Chen, Will Deacon, linux-media, Thierry Reding, open list:IOMMU DRIVERS, Bobby Batacharia (via Google Docs), Kyungmin Park, Marek Szyprowski, Yong Wu, Pawel Osciak, laurent.pinchart+renesas, Joerg Roedel, thunder.leizhen, Catalin Marinas, Tomasz Figa, Russell King, linux-mediatek +Tomasz, so he can reply to the thread +Marek and Russell as recommended by Tomasz On Oct 30, 2015 22:27, "Robin Murphy" <robin.murphy@arm.com> wrote: > > Hi Dan, > > On 30/10/15 01:17, Daniel Kurtz wrote: >> >> +linux-media & VIDEOBUF2 FRAMEWORK maintainers since this is about the >> v4l2-contig's usage of the DMA API. >> >> Hi Robin, >> >> On Tue, Oct 27, 2015 at 12:55 AM, Robin Murphy <robin.murphy@arm.com> wrote: >>> >>> On 26/10/15 13:44, Yong Wu wrote: >>>> >>>> >>>> On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote: >>>> [...] >>>>> >>>>> >>>>> +/* >>>>> + * The DMA API client is passing in a scatterlist which could describe >>>>> + * any old buffer layout, but the IOMMU API requires everything to be >>>>> + * aligned to IOMMU pages. Hence the need for this complicated bit of >>>>> + * impedance-matching, to be able to hand off a suitably-aligned list, >>>>> + * but still preserve the original offsets and sizes for the caller. >>>>> + */ >>>>> +int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, >>>>> + int nents, int prot) >>>>> +{ >>>>> + struct iommu_domain *domain = iommu_get_domain_for_dev(dev); >>>>> + struct iova_domain *iovad = domain->iova_cookie; >>>>> + struct iova *iova; >>>>> + struct scatterlist *s, *prev = NULL; >>>>> + dma_addr_t dma_addr; >>>>> + size_t iova_len = 0; >>>>> + int i; >>>>> + >>>>> + /* >>>>> + * Work out how much IOVA space we need, and align the segments >>>>> to >>>>> + * IOVA granules for the IOMMU driver to handle. With some clever >>>>> + * trickery we can modify the list in-place, but reversibly, by >>>>> + * hiding the original data in the as-yet-unused DMA fields. >>>>> + */ >>>>> + for_each_sg(sg, s, nents, i) { >>>>> + size_t s_offset = iova_offset(iovad, s->offset); >>>>> + size_t s_length = s->length; >>>>> + >>>>> + sg_dma_address(s) = s->offset; >>>>> + sg_dma_len(s) = s_length; >>>>> + s->offset -= s_offset; >>>>> + s_length = iova_align(iovad, s_length + s_offset); >>>>> + s->length = s_length; >>>>> + >>>>> + /* >>>>> + * The simple way to avoid the rare case of a segment >>>>> + * crossing the boundary mask is to pad the previous one >>>>> + * to end at a naturally-aligned IOVA for this one's >>>>> size, >>>>> + * at the cost of potentially over-allocating a little. >>>>> + */ >>>>> + if (prev) { >>>>> + size_t pad_len = roundup_pow_of_two(s_length); >>>>> + >>>>> + pad_len = (pad_len - iova_len) & (pad_len - 1); >>>>> + prev->length += pad_len; >>>> >>>> >>>> >>>> Hi Robin, >>>> While our v4l2 testing, It seems that we met a problem here. >>>> Here we update prev->length again, Do we need update >>>> sg_dma_len(prev) again too? >>>> >>>> Some function like vb2_dc_get_contiguous_size[1] always get >>>> sg_dma_len(s) to compare instead of s->length. so it may break >>>> unexpectedly while sg_dma_len(s) is not same with s->length. >>> >>> >>> >>> This is just tweaking the faked-up length that we hand off to iommu_map_sg() >>> (see also the iova_align() above), to trick it into bumping this segment up >>> to a suitable starting IOVA. The real length at this point is stashed in >>> sg_dma_len(s), and will be copied back into s->length in __finalise_sg(), so >>> both will hold the same true length once we return to the caller. >>> >>> Yes, it does mean that if you have a list where the segment lengths are page >>> aligned but not monotonically decreasing, e.g. {64k, 16k, 64k}, then you'll >>> still end up with a gap between the second and third segments, but that's >>> fine because the DMA API offers no guarantees about what the resulting DMA >>> addresses will be (consider the no-IOMMU case where they would each just be >>> "mapped" to their physical address). If that breaks v4l, then it's probably >>> v4l's DMA API use that needs looking at (again). >> >> >> Hmm, I thought the DMA API maps a (possibly) non-contiguous set of >> memory pages into a contiguous block in device memory address space. >> This would allow passing a dma mapped buffer to device dma using just >> a device address and length. > > > Not at all. The streaming DMA API (dma_map_* and friends) has two responsibilities: performing any necessary cache maintenance to ensure the device will correctly see data from the CPU, and the CPU will correctly see data from the device; and working out an address for that buffer from the device's point of view to actually hand off to the hardware (which is perfectly well allowed to fail). > > Consider SWIOTLB's implementation - segments which already lie at physical addresses within the device's DMA mask just get passed through, while those that lie outside it get mapped into the bounce buffer, but still as individual allocations (arch code just handles cache maintenance on the resulting physical addresses and can apply any hard-wired DMA offset for the device concerned). > >> IIUC, the change above breaks this model by inserting gaps in how the >> buffer is mapped to device memory, such that the buffer is no longer >> contiguous in dma address space. > > > Even the existing arch/arm IOMMU DMA code which I guess this implicitly relies on doesn't guarantee that behaviour - if the mapping happens to reach one of the segment length/boundary limits it won't just leave a gap, it'll start an entirely new IOVA allocation which could well start at a wildly different address[0]. > >> Here is the code in question from >> drivers/media/v4l2-core/videobuf2-dma-contig.c : >> >> static unsigned long vb2_dc_get_contiguous_size(struct sg_table *sgt) >> { >> struct scatterlist *s; >> dma_addr_t expected = sg_dma_address(sgt->sgl); >> unsigned int i; >> unsigned long size = 0; >> >> for_each_sg(sgt->sgl, s, sgt->nents, i) { >> if (sg_dma_address(s) != expected) >> break; >> expected = sg_dma_address(s) + sg_dma_len(s); >> size += sg_dma_len(s); >> } >> return size; >> } >> >> >> static void *vb2_dc_get_userptr(void *alloc_ctx, unsigned long vaddr, >> unsigned long size, enum dma_data_direction dma_dir) >> { >> struct vb2_dc_conf *conf = alloc_ctx; >> struct vb2_dc_buf *buf; >> struct frame_vector *vec; >> unsigned long offset; >> int n_pages, i; >> int ret = 0; >> struct sg_table *sgt; >> unsigned long contig_size; >> unsigned long dma_align = dma_get_cache_alignment(); >> DEFINE_DMA_ATTRS(attrs); >> >> dma_set_attr(DMA_ATTR_SKIP_CPU_SYNC, &attrs); >> >> buf = kzalloc(sizeof *buf, GFP_KERNEL); >> buf->dma_dir = dma_dir; >> >> offset = vaddr & ~PAGE_MASK; >> vec = vb2_create_framevec(vaddr, size, dma_dir == DMA_FROM_DEVICE); >> buf->vec = vec; >> n_pages = frame_vector_count(vec); >> >> sgt = kzalloc(sizeof(*sgt), GFP_KERNEL); >> >> ret = sg_alloc_table_from_pages(sgt, frame_vector_pages(vec), n_pages, >> offset, size, GFP_KERNEL); >> >> sgt->nents = dma_map_sg_attrs(buf->dev, sgt->sgl, sgt->orig_nents, >> buf->dma_dir, &attrs); >> >> contig_size = vb2_dc_get_contiguous_size(sgt); > > > (as an aside, it's rather unintuitive that the handling of the dma_map_sg call actually failing is entirely implicit here) > >> if (contig_size < size) { >> >> <<<=== if the original buffer had sg entries that were not >> aligned on the "natural" alignment for their size, the new arm64 iommu >> core code inserts a 'gap' in the iommu mapping, which causes >> vb2_dc_get_contiguous_size() to exit early (and return a smaller size >> than expected). >> >> pr_err("contiguous mapping is too small %lu/%lu\n", >> contig_size, size); >> ret = -EFAULT; >> goto fail_map_sg; >> } >> >> >> So, is the videobuf2-dma-contig.c based on an incorrect assumption >> about how the DMA API is supposed to work? >> Is it even possible to map a "contiguous-in-iova-range" mapping for a >> buffer given as an sg_table with an arbitrary set of pages? > > > From the Streaming DMA mappings section of Documentation/DMA-API.txt: > > Note also that the above constraints on physical contiguity and > dma_mask may not apply if the platform has an IOMMU (a device which > maps an I/O DMA address to a physical memory address). However, to be > portable, device driver writers may *not* assume that such an IOMMU > exists. > > There's not strictly any harm in using the DMA API this way and *hoping* you get what you want, as long as you're happy for it to fail pretty much 100% of the time on some systems, and still in a minority of corner cases on any system. However, if there's a real dependency on IOMMUs and tight control of IOVA allocation here, then the DMA API isn't really the right tool for the job, and maybe it's time to start looking to how to better fit these multimedia-subsystem-type use cases into the IOMMU API - as far as I understand it there's at least some conceptual overlap with the HSA PASID stuff being prototyped in PCI/x86-land at the moment, so it could be an apposite time to try and bang out some common requirements. > > Robin. > > [0]:http://article.gmane.org/gmane.linux.kernel.iommu/11185 > >> >> Thanks for helping to move this forward. >> >> -Dan >> > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping 2015-11-02 13:11 ` Daniel Kurtz @ 2015-11-02 13:43 ` Tomasz Figa 2015-11-03 17:41 ` Robin Murphy 0 siblings, 1 reply; 15+ messages in thread From: Tomasz Figa @ 2015-11-02 13:43 UTC (permalink / raw) To: Daniel Kurtz Cc: Robin Murphy, Lin PoChun, linux-arm-kernel@lists.infradead.org, Yingjoe Chen, Will Deacon, linux-media, Thierry Reding, open list:IOMMU DRIVERS, Bobby Batacharia (via Google Docs), Kyungmin Park, Marek Szyprowski, Yong Wu, Pawel Osciak, Laurent Pinchart, Joerg Roedel, thunder.leizhen, Catalin Marinas, Russell King, linux-mediatek On Mon, Nov 2, 2015 at 10:11 PM, Daniel Kurtz <djkurtz@chromium.org> wrote: > > +Tomasz, so he can reply to the thread > +Marek and Russell as recommended by Tomasz > > On Oct 30, 2015 22:27, "Robin Murphy" <robin.murphy@arm.com> wrote: > > > > Hi Dan, > > > > On 30/10/15 01:17, Daniel Kurtz wrote: > >> > >> +linux-media & VIDEOBUF2 FRAMEWORK maintainers since this is about the > >> v4l2-contig's usage of the DMA API. > >> > >> Hi Robin, > >> > >> On Tue, Oct 27, 2015 at 12:55 AM, Robin Murphy <robin.murphy@arm.com> wrote: > >>> > >>> On 26/10/15 13:44, Yong Wu wrote: > >>>> > >>>> > >>>> On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote: > >>>> [...] > >>>>> > >>>>> > >>>>> +/* > >>>>> + * The DMA API client is passing in a scatterlist which could describe > >>>>> + * any old buffer layout, but the IOMMU API requires everything to be > >>>>> + * aligned to IOMMU pages. Hence the need for this complicated bit of > >>>>> + * impedance-matching, to be able to hand off a suitably-aligned list, > >>>>> + * but still preserve the original offsets and sizes for the caller. > >>>>> + */ > >>>>> +int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, > >>>>> + int nents, int prot) > >>>>> +{ > >>>>> + struct iommu_domain *domain = iommu_get_domain_for_dev(dev); > >>>>> + struct iova_domain *iovad = domain->iova_cookie; > >>>>> + struct iova *iova; > >>>>> + struct scatterlist *s, *prev = NULL; > >>>>> + dma_addr_t dma_addr; > >>>>> + size_t iova_len = 0; > >>>>> + int i; > >>>>> + > >>>>> + /* > >>>>> + * Work out how much IOVA space we need, and align the segments > >>>>> to > >>>>> + * IOVA granules for the IOMMU driver to handle. With some clever > >>>>> + * trickery we can modify the list in-place, but reversibly, by > >>>>> + * hiding the original data in the as-yet-unused DMA fields. > >>>>> + */ > >>>>> + for_each_sg(sg, s, nents, i) { > >>>>> + size_t s_offset = iova_offset(iovad, s->offset); > >>>>> + size_t s_length = s->length; > >>>>> + > >>>>> + sg_dma_address(s) = s->offset; > >>>>> + sg_dma_len(s) = s_length; > >>>>> + s->offset -= s_offset; > >>>>> + s_length = iova_align(iovad, s_length + s_offset); > >>>>> + s->length = s_length; > >>>>> + > >>>>> + /* > >>>>> + * The simple way to avoid the rare case of a segment > >>>>> + * crossing the boundary mask is to pad the previous one > >>>>> + * to end at a naturally-aligned IOVA for this one's > >>>>> size, > >>>>> + * at the cost of potentially over-allocating a little. I'd like to know what is the boundary mask and what hardware imposes requirements like this. The cost here is not only over-allocating a little, but making many, many buffers contiguously mappable on the CPU, unmappable contiguously in IOMMU, which just defeats the purpose of having an IOMMU, which I believe should be there for simple IP blocks taking one DMA address to be able to view the buffer the same way as the CPU. > >>>>> + */ > >>>>> + if (prev) { > >>>>> + size_t pad_len = roundup_pow_of_two(s_length); > >>>>> + > >>>>> + pad_len = (pad_len - iova_len) & (pad_len - 1); > >>>>> + prev->length += pad_len; > >>>> > >>>> > >>>> > >>>> Hi Robin, > >>>> While our v4l2 testing, It seems that we met a problem here. > >>>> Here we update prev->length again, Do we need update > >>>> sg_dma_len(prev) again too? > >>>> > >>>> Some function like vb2_dc_get_contiguous_size[1] always get > >>>> sg_dma_len(s) to compare instead of s->length. so it may break > >>>> unexpectedly while sg_dma_len(s) is not same with s->length. > >>> > >>> > >>> > >>> This is just tweaking the faked-up length that we hand off to iommu_map_sg() > >>> (see also the iova_align() above), to trick it into bumping this segment up > >>> to a suitable starting IOVA. The real length at this point is stashed in > >>> sg_dma_len(s), and will be copied back into s->length in __finalise_sg(), so > >>> both will hold the same true length once we return to the caller. > >>> > >>> Yes, it does mean that if you have a list where the segment lengths are page > >>> aligned but not monotonically decreasing, e.g. {64k, 16k, 64k}, then you'll > >>> still end up with a gap between the second and third segments, but that's > >>> fine because the DMA API offers no guarantees about what the resulting DMA > >>> addresses will be (consider the no-IOMMU case where they would each just be > >>> "mapped" to their physical address). If that breaks v4l, then it's probably > >>> v4l's DMA API use that needs looking at (again). > >> > >> > >> Hmm, I thought the DMA API maps a (possibly) non-contiguous set of > >> memory pages into a contiguous block in device memory address space. > >> This would allow passing a dma mapped buffer to device dma using just > >> a device address and length. > > > > > > Not at all. The streaming DMA API (dma_map_* and friends) has two responsibilities: performing any necessary cache maintenance to ensure the device will correctly see data from the CPU, and the CPU will correctly see data from the device; and working out an address for that buffer from the device's point of view to actually hand off to the hardware (which is perfectly well allowed to fail). Agreed. The dma_map_*() API is not guaranteed to return a single contiguous part of virtual address space for any given SG list. However it was understood to be able to map buffers contiguously mappable by the CPU into a single segment and users, videobuf2-dma-contig in particular, relied on this. > > > > Consider SWIOTLB's implementation - segments which already lie at physical addresses within the device's DMA mask just get passed through, while those that lie outside it get mapped into the bounce buffer, but still as individual allocations (arch code just handles cache maintenance on the resulting physical addresses and can apply any hard-wired DMA offset for the device concerned). And this is fine for vb2-dma-contig, which was made for devices that require buffers contiguous in its address space. Without IOMMU it will allow only physically contiguous buffers and fails otherwise, which is fine, because it's a hardware requirement. > > > >> IIUC, the change above breaks this model by inserting gaps in how the > >> buffer is mapped to device memory, such that the buffer is no longer > >> contiguous in dma address space. > > > > > > Even the existing arch/arm IOMMU DMA code which I guess this implicitly relies on doesn't guarantee that behaviour - if the mapping happens to reach one of the segment length/boundary limits it won't just leave a gap, it'll start an entirely new IOVA allocation which could well start at a wildly different address[0]. Could you explain segment length/boundary limits and when buffers can reach them? Sorry, i haven't been following all the discussions, but I'm not aware of any similar requirements of the IOMMU hardware I worked with. > > > >> Here is the code in question from > >> drivers/media/v4l2-core/videobuf2-dma-contig.c : > >> > >> static unsigned long vb2_dc_get_contiguous_size(struct sg_table *sgt) > >> { > >> struct scatterlist *s; > >> dma_addr_t expected = sg_dma_address(sgt->sgl); > >> unsigned int i; > >> unsigned long size = 0; > >> > >> for_each_sg(sgt->sgl, s, sgt->nents, i) { > >> if (sg_dma_address(s) != expected) > >> break; > >> expected = sg_dma_address(s) + sg_dma_len(s); > >> size += sg_dma_len(s); > >> } > >> return size; > >> } > >> > >> > >> static void *vb2_dc_get_userptr(void *alloc_ctx, unsigned long vaddr, > >> unsigned long size, enum dma_data_direction dma_dir) > >> { > >> struct vb2_dc_conf *conf = alloc_ctx; > >> struct vb2_dc_buf *buf; > >> struct frame_vector *vec; > >> unsigned long offset; > >> int n_pages, i; > >> int ret = 0; > >> struct sg_table *sgt; > >> unsigned long contig_size; > >> unsigned long dma_align = dma_get_cache_alignment(); > >> DEFINE_DMA_ATTRS(attrs); > >> > >> dma_set_attr(DMA_ATTR_SKIP_CPU_SYNC, &attrs); > >> > >> buf = kzalloc(sizeof *buf, GFP_KERNEL); > >> buf->dma_dir = dma_dir; > >> > >> offset = vaddr & ~PAGE_MASK; > >> vec = vb2_create_framevec(vaddr, size, dma_dir == DMA_FROM_DEVICE); > >> buf->vec = vec; > >> n_pages = frame_vector_count(vec); > >> > >> sgt = kzalloc(sizeof(*sgt), GFP_KERNEL); > >> > >> ret = sg_alloc_table_from_pages(sgt, frame_vector_pages(vec), n_pages, > >> offset, size, GFP_KERNEL); > >> > >> sgt->nents = dma_map_sg_attrs(buf->dev, sgt->sgl, sgt->orig_nents, > >> buf->dma_dir, &attrs); > >> > >> contig_size = vb2_dc_get_contiguous_size(sgt); > > > > > > (as an aside, it's rather unintuitive that the handling of the dma_map_sg call actually failing is entirely implicit here) I'm not sure what you mean, please elaborate. The code considers only the case of contiguously mapping at least the requested size as a success, because anything else is useless with the hardware. > > > >> if (contig_size < size) { > >> > >> <<<=== if the original buffer had sg entries that were not > >> aligned on the "natural" alignment for their size, the new arm64 iommu > >> core code inserts a 'gap' in the iommu mapping, which causes > >> vb2_dc_get_contiguous_size() to exit early (and return a smaller size > >> than expected). > >> > >> pr_err("contiguous mapping is too small %lu/%lu\n", > >> contig_size, size); > >> ret = -EFAULT; > >> goto fail_map_sg; > >> } > >> > >> > >> So, is the videobuf2-dma-contig.c based on an incorrect assumption > >> about how the DMA API is supposed to work? > >> Is it even possible to map a "contiguous-in-iova-range" mapping for a > >> buffer given as an sg_table with an arbitrary set of pages? > > > > > > From the Streaming DMA mappings section of Documentation/DMA-API.txt: > > > > Note also that the above constraints on physical contiguity and > > dma_mask may not apply if the platform has an IOMMU (a device which > > maps an I/O DMA address to a physical memory address). However, to be > > portable, device driver writers may *not* assume that such an IOMMU > > exists. > > > > There's not strictly any harm in using the DMA API this way and *hoping* you get what you want, as long as you're happy for it to fail pretty much 100% of the time on some systems, and still in a minority of corner cases on any system. Could you please elaborate? I'd like to see examples, because I can't really imagine buffers mappable contiguously on CPU, but not on IOMMU. Also, as I said, the hardware I worked with didn't suffer from problems like this. > > However, if there's a real dependency on IOMMUs and tight control of IOVA allocation here, then the DMA API isn't really the right tool for the job, and maybe it's time to start looking to how to better fit these multimedia-subsystem-type use cases into the IOMMU API - as far as I understand it there's at least some conceptual overlap with the HSA PASID stuff being prototyped in PCI/x86-land at the moment, so it could be an apposite time to try and bang out some common requirements. The DMA API is actually the only good tool to use here to keep the videobuf2-dma-contig code away from the knowledge about platform specific data, e.g. presence of IOMMU. The only thing it knows is that the target hardware requires a single contiguous buffer and it relies on the fact that in correct cases the buffer given to it will meet this requirement (i.e. physically contiguous w/o IOMMU; CPU mappable with IOMMU). Best regards, Tomasz ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping 2015-11-02 13:43 ` Tomasz Figa @ 2015-11-03 17:41 ` Robin Murphy 2015-11-03 18:40 ` Russell King - ARM Linux 2015-11-04 5:12 ` Tomasz Figa 0 siblings, 2 replies; 15+ messages in thread From: Robin Murphy @ 2015-11-03 17:41 UTC (permalink / raw) To: Tomasz Figa, Daniel Kurtz Cc: Lin PoChun, linux-arm-kernel@lists.infradead.org, Yingjoe Chen, Will Deacon, linux-media, Thierry Reding, open list:IOMMU DRIVERS, Bobby Batacharia (via Google Docs), Kyungmin Park, Marek Szyprowski, Yong Wu, Pawel Osciak, Laurent Pinchart, Joerg Roedel, thunder.leizhen, Catalin Marinas, Russell King, linux-mediatek Hi Tomasz, On 02/11/15 13:43, Tomasz Figa wrote: > I'd like to know what is the boundary mask and what hardware imposes > requirements like this. The cost here is not only over-allocating a > little, but making many, many buffers contiguously mappable on the > CPU, unmappable contiguously in IOMMU, which just defeats the purpose > of having an IOMMU, which I believe should be there for simple IP > blocks taking one DMA address to be able to view the buffer the same > way as the CPU. The expectation with dma_map_sg() is that you're either going to be iterating over the buffer segments, handing off each address to the device to process one by one; or you have a scatter-gather-capable device, in which case you hand off the whole list at once. It's in the latter case where you have to make sure the list doesn't exceed the hardware limitations of that device. I believe the original concern was disk controllers (the introduction of dma_parms seems to originate from the linux-scsi list), but most scatter-gather engines are going to have some limit on how much they can handle per entry (IMO the dmaengine drivers are the easiest example to look at). Segment boundaries are a little more arcane, but my assumption is that they relate to the kind of devices whose addressing is not flat but relative to some separate segment register (The "64-bit" mode of USB EHCI is one concrete example I can think of) - since you cannot realistically change the segment register while the device is in the middle of accessing a single buffer entry, that entry must not fall across a segment boundary or at some point the device's accesses are going to overflow the offset address bits and wrap around to bogus addresses at the bottom of the segment. Now yes, it will be possible under _most_ circumstances to use an IOMMU to lay out a list of segments with page-aligned lengths within a single IOVA allocation whilst still meeting all the necessary constraints. It just needs some unavoidably complicated calculations - quite likely significantly more complex than my v5 version of map_sg() that tried to do that and merge segments but failed to take the initial alignment into account properly - since there are much simpler ways to enforce just the _necessary_ behaviour for the DMA API, I put the complicated stuff to one side for now to prevent it holding up getting the basic functional support in place. >>>> Hmm, I thought the DMA API maps a (possibly) non-contiguous set of >>>> memory pages into a contiguous block in device memory address space. >>>> This would allow passing a dma mapped buffer to device dma using just >>>> a device address and length. >>> >>> >>> Not at all. The streaming DMA API (dma_map_* and friends) has two responsibilities: performing any necessary cache maintenance to ensure the device will correctly see data from the CPU, and the CPU will correctly see data from the device; and working out an address for that buffer from the device's point of view to actually hand off to the hardware (which is perfectly well allowed to fail). > > Agreed. The dma_map_*() API is not guaranteed to return a single > contiguous part of virtual address space for any given SG list. > However it was understood to be able to map buffers contiguously > mappable by the CPU into a single segment and users, > videobuf2-dma-contig in particular, relied on this. I don't follow that - _any_ buffer made of page-sized chunks is going to be mappable contiguously by the CPU; it's clearly impossible for the streaming DMA API itself to offer such a guarantee, because it's entirely orthogonal to the presence or otherwise of an IOMMU. Furthermore, I can't see any existing dma_map_sg implementation (between arm/64 and x86, at least), that _won't_ break that expectation under certain conditions (ranging from "relatively pathological" to "always"), so it still seems questionable to have a dependency on it. >>> Consider SWIOTLB's implementation - segments which already lie at physical addresses within the device's DMA mask just get passed through, while those that lie outside it get mapped into the bounce buffer, but still as individual allocations (arch code just handles cache maintenance on the resulting physical addresses and can apply any hard-wired DMA offset for the device concerned). > > And this is fine for vb2-dma-contig, which was made for devices that > require buffers contiguous in its address space. Without IOMMU it will > allow only physically contiguous buffers and fails otherwise, which is > fine, because it's a hardware requirement. If it depends on having contiguous-from-the-device's-view DMA buffers either way, that's a sign it should perhaps be using the coherent DMA API instead, which _does_ give such a guarantee. I'm well aware of the "but the noncacheable mappings make userspace access unacceptably slow!" issue many folks have with that, though, and don't particularly fancy going off on that tangent here. >>>> IIUC, the change above breaks this model by inserting gaps in how the >>>> buffer is mapped to device memory, such that the buffer is no longer >>>> contiguous in dma address space. >>> >>> >>> Even the existing arch/arm IOMMU DMA code which I guess this implicitly relies on doesn't guarantee that behaviour - if the mapping happens to reach one of the segment length/boundary limits it won't just leave a gap, it'll start an entirely new IOVA allocation which could well start at a wildly different address[0]. > > Could you explain segment length/boundary limits and when buffers can > reach them? Sorry, i haven't been following all the discussions, but > I'm not aware of any similar requirements of the IOMMU hardware I > worked with. I hope the explanation at the top makes sense - it's purely about the requirements of the DMA master device itself, nothing to do with the IOMMU (or lack of) in the middle. Devices with scatter-gather DMA limitations exist, therefore the API for scatter-gather DMA is designed to represent and respect such limitations. >>>> Here is the code in question from >>>> drivers/media/v4l2-core/videobuf2-dma-contig.c : [...] >>>> static void *vb2_dc_get_userptr(void *alloc_ctx, unsigned long vaddr, >>>> unsigned long size, enum dma_data_direction dma_dir) >>>> { [...] >>>> sgt->nents = dma_map_sg_attrs(buf->dev, sgt->sgl, sgt->orig_nents, >>>> buf->dma_dir, &attrs); >>>> >>>> contig_size = vb2_dc_get_contiguous_size(sgt); >>> >>> >>> (as an aside, it's rather unintuitive that the handling of the dma_map_sg call actually failing is entirely implicit here) > > I'm not sure what you mean, please elaborate. The code considers only > the case of contiguously mapping at least the requested size as a > success, because anything else is useless with the hardware. My bad; having now compared against the actual file I see this is just a cherry-picking of relevant lines with all the error checking stripped out. Objection withdrawn ;) >>>> So, is the videobuf2-dma-contig.c based on an incorrect assumption >>>> about how the DMA API is supposed to work? >>>> Is it even possible to map a "contiguous-in-iova-range" mapping for a >>>> buffer given as an sg_table with an arbitrary set of pages? >>> >>> >>> From the Streaming DMA mappings section of Documentation/DMA-API.txt: >>> >>> Note also that the above constraints on physical contiguity and >>> dma_mask may not apply if the platform has an IOMMU (a device which >>> maps an I/O DMA address to a physical memory address). However, to be >>> portable, device driver writers may *not* assume that such an IOMMU >>> exists. >>> >>> There's not strictly any harm in using the DMA API this way and *hoping* you get what you want, as long as you're happy for it to fail pretty much 100% of the time on some systems, and still in a minority of corner cases on any system. > > Could you please elaborate? I'd like to see examples, because I can't > really imagine buffers mappable contiguously on CPU, but not on IOMMU. > Also, as I said, the hardware I worked with didn't suffer from > problems like this. "...device driver writers may *not* assume that such an IOMMU exists." >>> However, if there's a real dependency on IOMMUs and tight control of IOVA allocation here, then the DMA API isn't really the right tool for the job, and maybe it's time to start looking to how to better fit these multimedia-subsystem-type use cases into the IOMMU API - as far as I understand it there's at least some conceptual overlap with the HSA PASID stuff being prototyped in PCI/x86-land at the moment, so it could be an apposite time to try and bang out some common requirements. > > The DMA API is actually the only good tool to use here to keep the > videobuf2-dma-contig code away from the knowledge about platform > specific data, e.g. presence of IOMMU. The only thing it knows is that > the target hardware requires a single contiguous buffer and it relies > on the fact that in correct cases the buffer given to it will meet > this requirement (i.e. physically contiguous w/o IOMMU; CPU mappable > with IOMMU). As above; the DMA API guarantees only what the DMA API guarantees. An IOMMU-based implementation of streaming DMA is free to identity-map pages if it only cares about device isolation; a non-IOMMU implementation is free to provide streaming DMA remapping via some elaborate bounce-buffering scheme if it really wants to. GART-type IOMMUs... let's not even go there. If v4l needs a guarantee of a single contiguous DMA buffer, then it needs to use dma_alloc_coherent() for that, not streaming mappings. Robin. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping 2015-11-03 17:41 ` Robin Murphy @ 2015-11-03 18:40 ` Russell King - ARM Linux 2015-11-04 5:15 ` Tomasz Figa 2015-11-04 5:12 ` Tomasz Figa 1 sibling, 1 reply; 15+ messages in thread From: Russell King - ARM Linux @ 2015-11-03 18:40 UTC (permalink / raw) To: Robin Murphy Cc: Tomasz Figa, Daniel Kurtz, Lin PoChun, linux-arm-kernel@lists.infradead.org, Yingjoe Chen, Will Deacon, linux-media, Thierry Reding, open list:IOMMU DRIVERS, Bobby Batacharia (via Google Docs), Kyungmin Park, Marek Szyprowski, Yong Wu, Pawel Osciak, Laurent Pinchart, Joerg Roedel, thunder.leizhen, Catalin Marinas, linux-mediatek On Tue, Nov 03, 2015 at 05:41:24PM +0000, Robin Murphy wrote: > Hi Tomasz, > > On 02/11/15 13:43, Tomasz Figa wrote: > >Agreed. The dma_map_*() API is not guaranteed to return a single > >contiguous part of virtual address space for any given SG list. > >However it was understood to be able to map buffers contiguously > >mappable by the CPU into a single segment and users, > >videobuf2-dma-contig in particular, relied on this. > > I don't follow that - _any_ buffer made of page-sized chunks is going to be > mappable contiguously by the CPU; it's clearly impossible for the streaming > DMA API itself to offer such a guarantee, because it's entirely orthogonal > to the presence or otherwise of an IOMMU. Tomasz's use of "virtual address space" above in combination with the DMA API is really confusing. dma_map_sg() does *not* construct a CPU view of the passed scatterlist. The only thing dma_map_sg() might do with virtual addresses is to use them as a way to achieve cache coherence for one particular view of that memory, that being the kernel's own lowmem mapping and any kmaps. It doesn't extend to vmalloc() or userspace mappings of the memory. If the scatterlist is converted to an array of struct page pointers, it's possible to map it with vmap(), but it's implementation defined whether such a mapping will receive cache maintanence as part of the DMA API or not. (If you have PIPT caches, it will, if they're VIPT caches, maybe not.) There is a separate set of calls to deal with the flushing issues for vmap()'d memory in this case - see flush_kernel_vmap_range() and invalidate_kernel_vmap_range(). -- FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping 2015-11-03 18:40 ` Russell King - ARM Linux @ 2015-11-04 5:15 ` Tomasz Figa 2015-11-04 9:10 ` Russell King - ARM Linux 0 siblings, 1 reply; 15+ messages in thread From: Tomasz Figa @ 2015-11-04 5:15 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Robin Murphy, Daniel Kurtz, Lin PoChun, linux-arm-kernel@lists.infradead.org, Yingjoe Chen, Will Deacon, linux-media, Thierry Reding, open list:IOMMU DRIVERS, Bobby Batacharia (via Google Docs), Kyungmin Park, Marek Szyprowski, Yong Wu, Pawel Osciak, Laurent Pinchart, Joerg Roedel, thunder.leizhen, Catalin Marinas, linux-mediatek On Wed, Nov 4, 2015 at 3:40 AM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Tue, Nov 03, 2015 at 05:41:24PM +0000, Robin Murphy wrote: >> Hi Tomasz, >> >> On 02/11/15 13:43, Tomasz Figa wrote: >> >Agreed. The dma_map_*() API is not guaranteed to return a single >> >contiguous part of virtual address space for any given SG list. >> >However it was understood to be able to map buffers contiguously >> >mappable by the CPU into a single segment and users, >> >videobuf2-dma-contig in particular, relied on this. >> >> I don't follow that - _any_ buffer made of page-sized chunks is going to be >> mappable contiguously by the CPU; it's clearly impossible for the streaming >> DMA API itself to offer such a guarantee, because it's entirely orthogonal >> to the presence or otherwise of an IOMMU. > > Tomasz's use of "virtual address space" above in combination with the > DMA API is really confusing. I suppose I must have mistakenly use "virtual address space" somewhere instead of "IO virtual address space". I'm sorry for causing confusion. The thing being discussed here is mapping of buffers described by scatterlists into IO virtual address space, i.e. the operation happening when dma_map_sg() is called for an IOMMU-enabled device. Best regards, Tomasz ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping 2015-11-04 5:15 ` Tomasz Figa @ 2015-11-04 9:10 ` Russell King - ARM Linux 0 siblings, 0 replies; 15+ messages in thread From: Russell King - ARM Linux @ 2015-11-04 9:10 UTC (permalink / raw) To: Tomasz Figa Cc: Robin Murphy, Daniel Kurtz, Lin PoChun, linux-arm-kernel@lists.infradead.org, Yingjoe Chen, Will Deacon, linux-media, Thierry Reding, open list:IOMMU DRIVERS, Bobby Batacharia (via Google Docs), Kyungmin Park, Marek Szyprowski, Yong Wu, Pawel Osciak, Laurent Pinchart, Joerg Roedel, thunder.leizhen, Catalin Marinas, linux-mediatek On Wed, Nov 04, 2015 at 02:15:41PM +0900, Tomasz Figa wrote: > On Wed, Nov 4, 2015 at 3:40 AM, Russell King - ARM Linux > <linux@arm.linux.org.uk> wrote: > > On Tue, Nov 03, 2015 at 05:41:24PM +0000, Robin Murphy wrote: > >> Hi Tomasz, > >> > >> On 02/11/15 13:43, Tomasz Figa wrote: > >> >Agreed. The dma_map_*() API is not guaranteed to return a single > >> >contiguous part of virtual address space for any given SG list. > >> >However it was understood to be able to map buffers contiguously > >> >mappable by the CPU into a single segment and users, > >> >videobuf2-dma-contig in particular, relied on this. > >> > >> I don't follow that - _any_ buffer made of page-sized chunks is going to be > >> mappable contiguously by the CPU; it's clearly impossible for the streaming > >> DMA API itself to offer such a guarantee, because it's entirely orthogonal > >> to the presence or otherwise of an IOMMU. > > > > Tomasz's use of "virtual address space" above in combination with the > > DMA API is really confusing. > > I suppose I must have mistakenly use "virtual address space" somewhere > instead of "IO virtual address space". I'm sorry for causing > confusion. > > The thing being discussed here is mapping of buffers described by > scatterlists into IO virtual address space, i.e. the operation > happening when dma_map_sg() is called for an IOMMU-enabled device. ... and there, it's perfectly legal for an IOMMU to merge all entries in a scatterlist into one mapping - so dma_map_sg() would return 1. What that means is that the scatterlist contains the original number of entries which describes the CPU view of the buffer list using the original number of entries, and the DMA device view of the same but using just the first entry. In other words, if you're walking a scatterlist, and doing a mixture of DMA and PIO, you can't assume that if you're at scatterlist entry N for DMA, you can switch to PIO for entry N and you'll write to the same memory. (I know that there's badly written drivers in the kernel which unfortunately do make this assumption, and if they're used in the presence of an IOMMU, they _will_ be silently data corrupting.) -- FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping 2015-11-03 17:41 ` Robin Murphy 2015-11-03 18:40 ` Russell King - ARM Linux @ 2015-11-04 5:12 ` Tomasz Figa 2015-11-04 9:27 ` Russell King - ARM Linux 2015-11-09 13:11 ` Robin Murphy 1 sibling, 2 replies; 15+ messages in thread From: Tomasz Figa @ 2015-11-04 5:12 UTC (permalink / raw) To: Robin Murphy Cc: Daniel Kurtz, Lin PoChun, linux-arm-kernel@lists.infradead.org, Yingjoe Chen, Will Deacon, linux-media, Thierry Reding, open list:IOMMU DRIVERS, Bobby Batacharia (via Google Docs), Kyungmin Park, Marek Szyprowski, Yong Wu, Pawel Osciak, Laurent Pinchart, Joerg Roedel, thunder.leizhen, Catalin Marinas, Russell King, linux-mediatek On Wed, Nov 4, 2015 at 2:41 AM, Robin Murphy <robin.murphy@arm.com> wrote: > Hi Tomasz, > > On 02/11/15 13:43, Tomasz Figa wrote: >> >> I'd like to know what is the boundary mask and what hardware imposes >> requirements like this. The cost here is not only over-allocating a >> little, but making many, many buffers contiguously mappable on the >> CPU, unmappable contiguously in IOMMU, which just defeats the purpose >> of having an IOMMU, which I believe should be there for simple IP >> blocks taking one DMA address to be able to view the buffer the same >> way as the CPU. > > > The expectation with dma_map_sg() is that you're either going to be > iterating over the buffer segments, handing off each address to the device > to process one by one; My understanding of a scatterlist was that it represents a buffer as a whole, by joining together its physically discontinuous segments. I don't see how single segments (layout of which is completely up to the allocator; often just single pages) would be usable for hardware that needs to do some work more serious than just writing a byte stream continuously to subsequent buffers. In case of such simple devices you don't even need an IOMMU (for means other than protection and/or getting over address space limitations). However, IMHO the most important use case of an IOMMU is to make buffers, which are contiguous in CPU virtual address space (VA), contiguous in device's address space (IOVA). Your implementation of dma_map_sg() effectively breaks this ability, so I'm not really following why it's located under drivers/iommu and supposed to be used with IOMMU-enabled platforms... > or you have a scatter-gather-capable device, in which > case you hand off the whole list at once. No need for mapping ability of the IOMMU here as well (except for working around address space issues, as I mentioned above). > It's in the latter case where you > have to make sure the list doesn't exceed the hardware limitations of that > device. I believe the original concern was disk controllers (the > introduction of dma_parms seems to originate from the linux-scsi list), but > most scatter-gather engines are going to have some limit on how much they > can handle per entry (IMO the dmaengine drivers are the easiest example to > look at). > > Segment boundaries are a little more arcane, but my assumption is that they > relate to the kind of devices whose addressing is not flat but relative to > some separate segment register (The "64-bit" mode of USB EHCI is one > concrete example I can think of) - since you cannot realistically change the > segment register while the device is in the middle of accessing a single > buffer entry, that entry must not fall across a segment boundary or at some > point the device's accesses are going to overflow the offset address bits > and wrap around to bogus addresses at the bottom of the segment. The two requirements above sound like something really specific to scatter-gather-capable hardware, which as I pointed above, barely need an IOMMU (at least its mapping capabilities). We are talking here about very IOMMU-specific code, though... Now, while I see that on some systems there might be IOMMU used for improving protection and working around addressing issues with SG-capable hardware, the code shouldn't be breaking the majority of systems with IOMMU used as the only possible way to make physically discontinuous appear (IO-virtually) continuous to devices incapable of scatter-gather. > > Now yes, it will be possible under _most_ circumstances to use an IOMMU to > lay out a list of segments with page-aligned lengths within a single IOVA > allocation whilst still meeting all the necessary constraints. It just needs > some unavoidably complicated calculations - quite likely significantly more > complex than my v5 version of map_sg() that tried to do that and merge > segments but failed to take the initial alignment into account properly - > since there are much simpler ways to enforce just the _necessary_ behaviour > for the DMA API, I put the complicated stuff to one side for now to prevent > it holding up getting the basic functional support in place. Somehow just whatever currently done in arch/arm/mm/dma-mapping.c was sufficient and not overly complicated. See http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c#L1547 . I can see that the code there at least tries to comply with maximum segment size constraint. Segment boundary seems to be ignored, though. However, I'm convinced that in most (if not all) cases where IOMMU IOVA-contiguous mapping is needed, those two requirements don't exist. Do we really have to break the good hardware only because the bad^Wlimited one is broken? Couldn't we preserve the ARM-like behavior whenever dma_parms->segment_boundary_mask is set to all 1s and dma_parms->max_segment_size to UINT_MAX (what currently drivers used to set) or 0 (sounds more logical for the meaning of "no maximum given")? > >>>>> Hmm, I thought the DMA API maps a (possibly) non-contiguous set of >>>>> memory pages into a contiguous block in device memory address space. >>>>> This would allow passing a dma mapped buffer to device dma using just >>>>> a device address and length. >>>> >>>> >>>> >>>> Not at all. The streaming DMA API (dma_map_* and friends) has two >>>> responsibilities: performing any necessary cache maintenance to ensure the >>>> device will correctly see data from the CPU, and the CPU will correctly see >>>> data from the device; and working out an address for that buffer from the >>>> device's point of view to actually hand off to the hardware (which is >>>> perfectly well allowed to fail). >> >> >> Agreed. The dma_map_*() API is not guaranteed to return a single >> contiguous part of virtual address space for any given SG list. >> However it was understood to be able to map buffers contiguously >> mappable by the CPU into a single segment and users, >> videobuf2-dma-contig in particular, relied on this. > > > I don't follow that - _any_ buffer made of page-sized chunks is going to be > mappable contiguously by the CPU;' Yes it is. Actually the last chunk might not even need to be page-sized. However I believe we can have a scatterlist consisting of non-page-sized chunks in the middle as well, which is obviously not mappable in a contiguous way even for the CPU. > it's clearly impossible for the streaming > DMA API itself to offer such a guarantee, because it's entirely orthogonal > to the presence or otherwise of an IOMMU. But we are talking here about the very IOMMU-specific implementation of DMA API. > > Furthermore, I can't see any existing dma_map_sg implementation (between > arm/64 and x86, at least), that _won't_ break that expectation under certain > conditions (ranging from "relatively pathological" to "always"), so it still > seems questionable to have a dependency on it. The current implementation for arch/arm doesn't break that expectation. As long as we fit inside the maximum segment size (which in most, if not all, cases of the hardware that actually requires such contiguous mapping to be created, is UINT_MAX). http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c#L1547 > >>>> Consider SWIOTLB's implementation - segments which already lie at >>>> physical addresses within the device's DMA mask just get passed through, >>>> while those that lie outside it get mapped into the bounce buffer, but still >>>> as individual allocations (arch code just handles cache maintenance on the >>>> resulting physical addresses and can apply any hard-wired DMA offset for the >>>> device concerned). >> >> >> And this is fine for vb2-dma-contig, which was made for devices that >> require buffers contiguous in its address space. Without IOMMU it will >> allow only physically contiguous buffers and fails otherwise, which is >> fine, because it's a hardware requirement. > > > If it depends on having contiguous-from-the-device's-view DMA buffers either > way, that's a sign it should perhaps be using the coherent DMA API instead, > which _does_ give such a guarantee. I'm well aware of the "but the > noncacheable mappings make userspace access unacceptably slow!" issue many > folks have with that, though, and don't particularly fancy going off on that > tangent here. The keywords here are DMA-BUF and user pointer. Neither of these cases can use coherent DMA API, because the buffer is already allocated, so it just needs to be mapped into another device's (or its IOMMU's) address space. Obviously we can't guarantee mappability of such buffers, e.g. in case of importing non-contiguous buffers to a device without an IOMMU, However we expect the pipelines to be sane (physically contiguous buffers or both devices IOMMU-enabled), so that such things won't happen. > >>>>> IIUC, the change above breaks this model by inserting gaps in how the >>>>> buffer is mapped to device memory, such that the buffer is no longer >>>>> contiguous in dma address space. >>>> >>>> >>>> >>>> Even the existing arch/arm IOMMU DMA code which I guess this implicitly >>>> relies on doesn't guarantee that behaviour - if the mapping happens to reach >>>> one of the segment length/boundary limits it won't just leave a gap, it'll >>>> start an entirely new IOVA allocation which could well start at a wildly >>>> different address[0]. >> >> >> Could you explain segment length/boundary limits and when buffers can >> reach them? Sorry, i haven't been following all the discussions, but >> I'm not aware of any similar requirements of the IOMMU hardware I >> worked with. > > > I hope the explanation at the top makes sense - it's purely about the > requirements of the DMA master device itself, nothing to do with the IOMMU > (or lack of) in the middle. Devices with scatter-gather DMA limitations > exist, therefore the API for scatter-gather DMA is designed to represent and > respect such limitations. Yes, it makes sense, thanks for the explanation. However there also exist devices with no scatter-gather capability, but behind an IOMMU without such fancy mapping limitations. I believe we should also respect the limitation of such setups, which is the lack of support for multiple IOVA segments. >>>>> So, is the videobuf2-dma-contig.c based on an incorrect assumption >>>>> about how the DMA API is supposed to work? >>>>> Is it even possible to map a "contiguous-in-iova-range" mapping for a >>>>> buffer given as an sg_table with an arbitrary set of pages? >>>> >>>> >>>> >>>> From the Streaming DMA mappings section of Documentation/DMA-API.txt: >>>> >>>> Note also that the above constraints on physical contiguity and >>>> dma_mask may not apply if the platform has an IOMMU (a device which >>>> maps an I/O DMA address to a physical memory address). However, to >>>> be >>>> portable, device driver writers may *not* assume that such an IOMMU >>>> exists. >>>> >>>> There's not strictly any harm in using the DMA API this way and *hoping* >>>> you get what you want, as long as you're happy for it to fail pretty much >>>> 100% of the time on some systems, and still in a minority of corner cases on >>>> any system. >> >> >> Could you please elaborate? I'd like to see examples, because I can't >> really imagine buffers mappable contiguously on CPU, but not on IOMMU. >> Also, as I said, the hardware I worked with didn't suffer from >> problems like this. > > > "...device driver writers may *not* assume that such an IOMMU exists." > And this is exactly why they _should_ use dma_map_sg(), because it was supposed to work correctly for both physically contiguous (i.e. 1 segment) buffers and non-IOMMU-enabled devices, as well as with non-contiguous (i.e. > 1 segment) buffers and IOMMU-enabled devices. >>>> However, if there's a real dependency on IOMMUs and tight control of >>>> IOVA allocation here, then the DMA API isn't really the right tool for the >>>> job, and maybe it's time to start looking to how to better fit these >>>> multimedia-subsystem-type use cases into the IOMMU API - as far as I >>>> understand it there's at least some conceptual overlap with the HSA PASID >>>> stuff being prototyped in PCI/x86-land at the moment, so it could be an >>>> apposite time to try and bang out some common requirements. >> >> >> The DMA API is actually the only good tool to use here to keep the >> videobuf2-dma-contig code away from the knowledge about platform >> specific data, e.g. presence of IOMMU. The only thing it knows is that >> the target hardware requires a single contiguous buffer and it relies >> on the fact that in correct cases the buffer given to it will meet >> this requirement (i.e. physically contiguous w/o IOMMU; CPU mappable >> with IOMMU). > > > As above; the DMA API guarantees only what the DMA API guarantees. An > IOMMU-based implementation of streaming DMA is free to identity-map pages if > it only cares about device isolation; a non-IOMMU implementation is free to > provide streaming DMA remapping via some elaborate bounce-buffering scheme I guess this is the area where our understandings of IOMMU-backed DMA API differ. > if it really wants to. GART-type IOMMUs... let's not even go there. I believe that's how IOMMU-based implementation of DMA API was supposed to work when first implemented for ARM... > > If v4l needs a guarantee of a single contiguous DMA buffer, then it needs to > use dma_alloc_coherent() for that, not streaming mappings. Except that it can't use it, because the buffers are already allocated by another entity. Best regards, Tomasz ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping 2015-11-04 5:12 ` Tomasz Figa @ 2015-11-04 9:27 ` Russell King - ARM Linux 2015-11-04 9:48 ` Tomasz Figa 2015-11-09 13:11 ` Robin Murphy 1 sibling, 1 reply; 15+ messages in thread From: Russell King - ARM Linux @ 2015-11-04 9:27 UTC (permalink / raw) To: Tomasz Figa Cc: Robin Murphy, Laurent Pinchart, Pawel Osciak, Catalin Marinas, Joerg Roedel, Will Deacon, Kyungmin Park, Daniel Kurtz, Yong Wu, open list:IOMMU DRIVERS, Bobby Batacharia (via Google Docs), linux-mediatek, Lin PoChun, thunder.leizhen, Marek Szyprowski, Yingjoe Chen, Thierry Reding, linux-arm-kernel@lists.infradead.org, linux-media On Wed, Nov 04, 2015 at 02:12:03PM +0900, Tomasz Figa wrote: > My understanding of a scatterlist was that it represents a buffer as a > whole, by joining together its physically discontinuous segments. Correct, and it may also be scattered in CPU virtual space as well. > I don't see how single segments (layout of which is completely up to > the allocator; often just single pages) would be usable for hardware > that needs to do some work more serious than just writing a byte > stream continuously to subsequent buffers. In case of such simple > devices you don't even need an IOMMU (for means other than protection > and/or getting over address space limitations). All that's required is that the addresses described in the scatterlist are accessed as an apparently contiguous series of bytes. They don't have to be contiguous in any address view, provided the device access appears to be contiguous. How that is done is really neither here nor there. IOMMUs are normally there as an address translator - for example, the underlying device may not have the capability to address a scatterlist (eg, because it makes effectively random access) and in order to be accessible to the device, it needs to be made contiguous in device address space. Another scenario is that you have more bits of physical address than a device can generate itself for DMA purposes, and you need an IOMMU to create a (possibly scattered) mapping in device address space within the ability of the device to address. The requirements here depend on the device behind the IOMMU. > However, IMHO the most important use case of an IOMMU is to make > buffers, which are contiguous in CPU virtual address space (VA), > contiguous in device's address space (IOVA). No - there is no requirement for CPU virtual contiguous buffers to also be contiguous in the device address space. -- FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping 2015-11-04 9:27 ` Russell King - ARM Linux @ 2015-11-04 9:48 ` Tomasz Figa 2015-11-04 10:50 ` Russell King - ARM Linux 0 siblings, 1 reply; 15+ messages in thread From: Tomasz Figa @ 2015-11-04 9:48 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Robin Murphy, Laurent Pinchart, Pawel Osciak, Catalin Marinas, Joerg Roedel, Will Deacon, Kyungmin Park, Daniel Kurtz, Yong Wu, open list:IOMMU DRIVERS, Bobby Batacharia (via Google Docs), linux-mediatek, Lin PoChun, thunder.leizhen, Marek Szyprowski, Yingjoe Chen, Thierry Reding, linux-arm-kernel@lists.infradead.org, linux-media On Wed, Nov 4, 2015 at 6:27 PM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Wed, Nov 04, 2015 at 02:12:03PM +0900, Tomasz Figa wrote: >> My understanding of a scatterlist was that it represents a buffer as a >> whole, by joining together its physically discontinuous segments. > > Correct, and it may also be scattered in CPU virtual space as well. > >> I don't see how single segments (layout of which is completely up to >> the allocator; often just single pages) would be usable for hardware >> that needs to do some work more serious than just writing a byte >> stream continuously to subsequent buffers. In case of such simple >> devices you don't even need an IOMMU (for means other than protection >> and/or getting over address space limitations). > > All that's required is that the addresses described in the scatterlist > are accessed as an apparently contiguous series of bytes. They don't > have to be contiguous in any address view, provided the device access > appears to be contiguous. How that is done is really neither here nor > there. > > IOMMUs are normally there as an address translator - for example, the > underlying device may not have the capability to address a scatterlist > (eg, because it makes effectively random access) and in order to be > accessible to the device, it needs to be made contiguous in device > address space. > > Another scenario is that you have more bits of physical address than > a device can generate itself for DMA purposes, and you need an IOMMU > to create a (possibly scattered) mapping in device address space > within the ability of the device to address. > > The requirements here depend on the device behind the IOMMU. I fully agree with you. The problem is that the code being discussed here breaks the case of devices that don't have the capability of addressing a scatterlist, supposedly for the sake of devices that have such capability (but as I suggested, they both could be happily supported, by distinguishing special values of DMA max segment size and boundary mask). >> However, IMHO the most important use case of an IOMMU is to make >> buffers, which are contiguous in CPU virtual address space (VA), >> contiguous in device's address space (IOVA). > > No - there is no requirement for CPU virtual contiguous buffers to also > be contiguous in the device address space. There is no requirement, but shouldn't it be desired for the mapping code to map them as such? Otherwise, how could the IOMMU use case you described above (address translator for devices which don't have the capability to address a scatterlist) be handled properly? Is the general conclusion now that dma_map_sg() should not be used to create IOMMU mappings and we should make a step backwards making all drivers (or frameworks, such as videobuf2) do that manually? That would be really backwards, because code not aware of IOMMU existence at all would have to become aware of it. Best regards, Tomasz ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping 2015-11-04 9:48 ` Tomasz Figa @ 2015-11-04 10:50 ` Russell King - ARM Linux 0 siblings, 0 replies; 15+ messages in thread From: Russell King - ARM Linux @ 2015-11-04 10:50 UTC (permalink / raw) To: Tomasz Figa Cc: Robin Murphy, Laurent Pinchart, Pawel Osciak, Catalin Marinas, Joerg Roedel, Will Deacon, Kyungmin Park, Daniel Kurtz, Yong Wu, open list:IOMMU DRIVERS, Bobby Batacharia (via Google Docs), linux-mediatek, Lin PoChun, thunder.leizhen, Marek Szyprowski, Yingjoe Chen, Thierry Reding, linux-arm-kernel@lists.infradead.org, linux-media On Wed, Nov 04, 2015 at 06:48:50PM +0900, Tomasz Figa wrote: > There is no requirement, but shouldn't it be desired for the mapping > code to map them as such? Otherwise, how could the IOMMU use case you > described above (address translator for devices which don't have the > capability to address a scatterlist) be handled properly? It's up to the IOMMU code to respect the parameters that the device has supplied to it via the device_dma_parameters. This doesn't currently allow a device to say "I want this scatterlist to be mapped as a contiguous device address", so really if a device has such a requirement, at the moment the device driver _must_ check the dma_map_sg() return value and act accordingly. While it's possible to say "an IOMMU should map as a single contiguous address" what happens when the IOMMU's device address space becomes fragmented? > Is the general conclusion now that dma_map_sg() should not be used to > create IOMMU mappings and we should make a step backwards making all > drivers (or frameworks, such as videobuf2) do that manually? That > would be really backwards, because code not aware of IOMMU existence > at all would have to become aware of it. No. The DMA API has always had the responsibility for managing the IOMMU device, which may well be shared between multiple different devices. However, if the IOMMU is part of a device IP block (such as a GPU) then the decision on whether the DMA API should be used or not is up to the driver author. If it has special management requirements, then it's probably appropriate for the device driver to manage it by itself. For example, a GPUs MMU may need something inserted into the GPUs command stream to flush the MMU TLBs. Such cases are inappropriate to be using the DMA API for IOMMU management. -- FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping 2015-11-04 5:12 ` Tomasz Figa 2015-11-04 9:27 ` Russell King - ARM Linux @ 2015-11-09 13:11 ` Robin Murphy 1 sibling, 0 replies; 15+ messages in thread From: Robin Murphy @ 2015-11-09 13:11 UTC (permalink / raw) To: Tomasz Figa Cc: Daniel Kurtz, Lin PoChun, linux-arm-kernel@lists.infradead.org, Yingjoe Chen, Will Deacon, linux-media, Thierry Reding, open list:IOMMU DRIVERS, Bobby Batacharia (via Google Docs), Kyungmin Park, Marek Szyprowski, Yong Wu, Pawel Osciak, Laurent Pinchart, Joerg Roedel, thunder.leizhen, Catalin Marinas, Russell King, linux-mediatek On 04/11/15 05:12, Tomasz Figa wrote: > On Wed, Nov 4, 2015 at 2:41 AM, Robin Murphy <robin.murphy@arm.com> wrote: >> Hi Tomasz, >> >> On 02/11/15 13:43, Tomasz Figa wrote: >>> >>> I'd like to know what is the boundary mask and what hardware imposes >>> requirements like this. The cost here is not only over-allocating a >>> little, but making many, many buffers contiguously mappable on the >>> CPU, unmappable contiguously in IOMMU, which just defeats the purpose >>> of having an IOMMU, which I believe should be there for simple IP >>> blocks taking one DMA address to be able to view the buffer the same >>> way as the CPU. >> >> >> The expectation with dma_map_sg() is that you're either going to be >> iterating over the buffer segments, handing off each address to the device >> to process one by one; > > My understanding of a scatterlist was that it represents a buffer as a > whole, by joining together its physically discontinuous segments. It can, but there are also cases where a single scatterlist is used to batch up multiple I/O requests - see the stuff in block/blk-merge.c as described in section 2.2 of Documentation/biodoc.txt, and AFAICS anyone could quite happily use the dmaengine API, and possibly others, in the same way. Ultimately a scatterlist is no more specific than "a list of blocks of physical memory that each want giving a DMA address". > I don't see how single segments (layout of which is completely up to > the allocator; often just single pages) would be usable for hardware > that needs to do some work more serious than just writing a byte > stream continuously to subsequent buffers. In case of such simple > devices you don't even need an IOMMU (for means other than protection > and/or getting over address space limitations). > > However, IMHO the most important use case of an IOMMU is to make > buffers, which are contiguous in CPU virtual address space (VA), > contiguous in device's address space (IOVA). Your implementation of > dma_map_sg() effectively breaks this ability, so I'm not really > following why it's located under drivers/iommu and supposed to be used > with IOMMU-enabled platforms... > >> or you have a scatter-gather-capable device, in which >> case you hand off the whole list at once. > > No need for mapping ability of the IOMMU here as well (except for > working around address space issues, as I mentioned above). Ok, now I'm starting to wonder if you're wilfully choosing to miss the point. Look at 64-bit systems of any architecture, and those address space issues are pretty much the primary consideration for including an IOMMU in the first place (behind virtualisation, which we can forget about here). Take the Juno board on my desk - most of the peripherals cannot address 75% of the RAM, and CPU bounce buffers are both not overly efficient and a limited resource (try using dmatest with sufficiently large buffers to stress/measure memory bandwidth and watch it take down the kernel, and that's without any other SWIOTLB contention). The only one that really cares at all about contiguous buffers is the HDLCD, but that's perfectly happy when it calls dma_alloc_coherent() via drm_fb_cma_helper and pulls a contiguous 8MB framebuffer out of thin air, without even knowing that CMA itself is disabled and it couldn't natively address 75% of the memory that might be backing that buffer. That last point also illustrates that the thing for providing DMA-contiguous buffers is indeed very good at providing DMA-contiguous buffers when backed by an IOMMU. >> It's in the latter case where you >> have to make sure the list doesn't exceed the hardware limitations of that >> device. I believe the original concern was disk controllers (the >> introduction of dma_parms seems to originate from the linux-scsi list), but >> most scatter-gather engines are going to have some limit on how much they >> can handle per entry (IMO the dmaengine drivers are the easiest example to >> look at). >> >> Segment boundaries are a little more arcane, but my assumption is that they >> relate to the kind of devices whose addressing is not flat but relative to >> some separate segment register (The "64-bit" mode of USB EHCI is one >> concrete example I can think of) - since you cannot realistically change the >> segment register while the device is in the middle of accessing a single >> buffer entry, that entry must not fall across a segment boundary or at some >> point the device's accesses are going to overflow the offset address bits >> and wrap around to bogus addresses at the bottom of the segment. > > The two requirements above sound like something really specific to > scatter-gather-capable hardware, which as I pointed above, barely need > an IOMMU (at least its mapping capabilities). We are talking here > about very IOMMU-specific code, though... > > Now, while I see that on some systems there might be IOMMU used for > improving protection and working around addressing issues with > SG-capable hardware, the code shouldn't be breaking the majority of > systems with IOMMU used as the only possible way to make physically > discontinuous appear (IO-virtually) continuous to devices incapable of > scatter-gather. Unless this majority of systems are all 64-bit ARMv8 ones running code that works perfectly _with the existing SWIOTLB DMA API implementation_ but not with this implementation, then I disagree that anything is being broken that wasn't already broken with respect to portability. Otherwise, please give me the details of any regressions with these patches relative to SWIOTLB DMA on arm64 so I can look into them. >> Now yes, it will be possible under _most_ circumstances to use an IOMMU to >> lay out a list of segments with page-aligned lengths within a single IOVA >> allocation whilst still meeting all the necessary constraints. It just needs >> some unavoidably complicated calculations - quite likely significantly more >> complex than my v5 version of map_sg() that tried to do that and merge >> segments but failed to take the initial alignment into account properly - >> since there are much simpler ways to enforce just the _necessary_ behaviour >> for the DMA API, I put the complicated stuff to one side for now to prevent >> it holding up getting the basic functional support in place. > > Somehow just whatever currently done in arch/arm/mm/dma-mapping.c was > sufficient and not overly complicated. > > See http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c#L1547 . > > I can see that the code there at least tries to comply with maximum > segment size constraint. Segment boundary seems to be ignored, though. It certainly doesn't map the entire list into a single IOVA allocation as here (such that everything is laid out in contiguous IOVA pages _regardless_ of the segment lengths, and unmapping becomes nicely trivial). That it also is the only implementation which fails to respect segment boundaries really just implies that it's probably not seen much use beyond supporting graphics hardware on 32-bit systems, and/or has just got lucky otherwise. > However, I'm convinced that in most (if not all) cases where IOMMU > IOVA-contiguous mapping is needed, those two requirements don't exist. > Do we really have to break the good hardware only because the > bad^Wlimited one is broken? Where "is broken" at least encompasses "is a SATA controller", presumably. Here's an example I've actually played with: http://lxr.free-electrons.com/source/drivers/ata/sata_sil24.c#L390 It doesn't seem all that unreasonable that hardware that fundamentally works in fixed-size blocks of data wants its data aligned to its block size (or some efficient multiple). Implementing an API which has guaranteed support for that requirement from the outset necessitates supporting that requirement. I'm not going to buy the argument that having some video device DMA into userspace pages is more important than being able to boot at all (and not corrupting your filesystem). > Couldn't we preserve the ARM-like behavior whenever > dma_parms->segment_boundary_mask is set to all 1s and > dma_parms->max_segment_size to UINT_MAX (what currently drivers used > to set) or 0 (sounds more logical for the meaning of "no maximum > given")? Sure, I was always aiming to ultimately improve on the arch/arm implementation (i.e. with the single allocation thing), but for a common general-purpose implementation that's going to be shared by multiple architectures, correctness comes way before optimisation for one specific use-case. Thus we start with a baseline version that we know correctly implements all the required behaviour specified by the DMA API, then start tweaking it for other considerations later. FWIW, I've already sketched out such a follow-on patch to start tightening up map_sg (because exposing any pages to the device more than absolutely necessary is not what we want in the long run). The thought that it's likely to be jumped on and used as an excuse to justify bad code elsewhere does rather sour the idea, though. >>>>>> Hmm, I thought the DMA API maps a (possibly) non-contiguous set of >>>>>> memory pages into a contiguous block in device memory address space. >>>>>> This would allow passing a dma mapped buffer to device dma using just >>>>>> a device address and length. >>>>> >>>>> >>>>> >>>>> Not at all. The streaming DMA API (dma_map_* and friends) has two >>>>> responsibilities: performing any necessary cache maintenance to ensure the >>>>> device will correctly see data from the CPU, and the CPU will correctly see >>>>> data from the device; and working out an address for that buffer from the >>>>> device's point of view to actually hand off to the hardware (which is >>>>> perfectly well allowed to fail). >>> >>> >>> Agreed. The dma_map_*() API is not guaranteed to return a single >>> contiguous part of virtual address space for any given SG list. >>> However it was understood to be able to map buffers contiguously >>> mappable by the CPU into a single segment and users, >>> videobuf2-dma-contig in particular, relied on this. >> >> >> I don't follow that - _any_ buffer made of page-sized chunks is going to be >> mappable contiguously by the CPU;' > > Yes it is. Actually the last chunk might not even need to be > page-sized. However I believe we can have a scatterlist consisting of > non-page-sized chunks in the middle as well, which is obviously not > mappable in a contiguous way even for the CPU. > >> it's clearly impossible for the streaming >> DMA API itself to offer such a guarantee, because it's entirely orthogonal >> to the presence or otherwise of an IOMMU. > > But we are talking here about the very IOMMU-specific implementation of DMA API. Exactly, therein lies the problem! The whole point of an API is that we write code against the provided _interface_, not against some particular implementation detail. To quote Raymond Chen, "I can't believe I had to write that". I fail to see how anyone would be surprised that code which is reliant on specific non-contractual behaviour of a particular API implementation is not portable to other implementations of that API. >> Furthermore, I can't see any existing dma_map_sg implementation (between >> arm/64 and x86, at least), that _won't_ break that expectation under certain >> conditions (ranging from "relatively pathological" to "always"), so it still >> seems questionable to have a dependency on it. > > The current implementation for arch/arm doesn't break that > expectation. As long as we fit inside the maximum segment size (which > in most, if not all, cases of the hardware that actually requires such > contiguous mapping to be created, is UINT_MAX). Well, yes, that just restates my point exactly; outside of certain conditions you will still get a non-contiguous mapping. Put that exact code on a 64-bit system, throw a scatterlist describing a "relatively pathological" 5GB buffer into it, and see what you get out. > http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c#L1547 > >> >>>>> Consider SWIOTLB's implementation - segments which already lie at >>>>> physical addresses within the device's DMA mask just get passed through, >>>>> while those that lie outside it get mapped into the bounce buffer, but still >>>>> as individual allocations (arch code just handles cache maintenance on the >>>>> resulting physical addresses and can apply any hard-wired DMA offset for the >>>>> device concerned). >>> >>> >>> And this is fine for vb2-dma-contig, which was made for devices that >>> require buffers contiguous in its address space. Without IOMMU it will >>> allow only physically contiguous buffers and fails otherwise, which is >>> fine, because it's a hardware requirement. >> >> >> If it depends on having contiguous-from-the-device's-view DMA buffers either >> way, that's a sign it should perhaps be using the coherent DMA API instead, >> which _does_ give such a guarantee. I'm well aware of the "but the >> noncacheable mappings make userspace access unacceptably slow!" issue many >> folks have with that, though, and don't particularly fancy going off on that >> tangent here. > > The keywords here are DMA-BUF and user pointer. Neither of these cases > can use coherent DMA API, because the buffer is already allocated, so > it just needs to be mapped into another device's (or its IOMMU's) > address space. Obviously we can't guarantee mappability of such > buffers, e.g. in case of importing non-contiguous buffers to a device > without an IOMMU, However we expect the pipelines to be sane > (physically contiguous buffers or both devices IOMMU-enabled), so that > such things won't happen. The "guarantee to map these scatterlist pages contiguously in IOVA space if an IOMMU is present" function is named iommu_map_sg(). There is nothing in the DMA API offering that behaviour. How well does vb2-dma-contig work with the x86 IOMMUs? >>>>>> IIUC, the change above breaks this model by inserting gaps in how the >>>>>> buffer is mapped to device memory, such that the buffer is no longer >>>>>> contiguous in dma address space. >>>>> >>>>> >>>>> >>>>> Even the existing arch/arm IOMMU DMA code which I guess this implicitly >>>>> relies on doesn't guarantee that behaviour - if the mapping happens to reach >>>>> one of the segment length/boundary limits it won't just leave a gap, it'll >>>>> start an entirely new IOVA allocation which could well start at a wildly >>>>> different address[0]. >>> >>> >>> Could you explain segment length/boundary limits and when buffers can >>> reach them? Sorry, i haven't been following all the discussions, but >>> I'm not aware of any similar requirements of the IOMMU hardware I >>> worked with. >> >> >> I hope the explanation at the top makes sense - it's purely about the >> requirements of the DMA master device itself, nothing to do with the IOMMU >> (or lack of) in the middle. Devices with scatter-gather DMA limitations >> exist, therefore the API for scatter-gather DMA is designed to represent and >> respect such limitations. > > Yes, it makes sense, thanks for the explanation. However there also > exist devices with no scatter-gather capability, but behind an IOMMU > without such fancy mapping limitations. I believe we should also > respect the limitation of such setups, which is the lack of support > for multiple IOVA segments. > >>>>>> So, is the videobuf2-dma-contig.c based on an incorrect assumption >>>>>> about how the DMA API is supposed to work? >>>>>> Is it even possible to map a "contiguous-in-iova-range" mapping for a >>>>>> buffer given as an sg_table with an arbitrary set of pages? >>>>> >>>>> >>>>> >>>>> From the Streaming DMA mappings section of Documentation/DMA-API.txt: >>>>> >>>>> Note also that the above constraints on physical contiguity and >>>>> dma_mask may not apply if the platform has an IOMMU (a device which >>>>> maps an I/O DMA address to a physical memory address). However, to >>>>> be >>>>> portable, device driver writers may *not* assume that such an IOMMU >>>>> exists. >>>>> >>>>> There's not strictly any harm in using the DMA API this way and *hoping* >>>>> you get what you want, as long as you're happy for it to fail pretty much >>>>> 100% of the time on some systems, and still in a minority of corner cases on >>>>> any system. >>> >>> >>> Could you please elaborate? I'd like to see examples, because I can't >>> really imagine buffers mappable contiguously on CPU, but not on IOMMU. >>> Also, as I said, the hardware I worked with didn't suffer from >>> problems like this. >> >> >> "...device driver writers may *not* assume that such an IOMMU exists." >> > > And this is exactly why they _should_ use dma_map_sg(), because it was > supposed to work correctly for both physically contiguous (i.e. 1 > segment) buffers and non-IOMMU-enabled devices, as well as with > non-contiguous (i.e. > 1 segment) buffers and IOMMU-enabled devices. Note that the number of segments has nothing to do with whether they are contiguous (in any address space) or not. In fact, while I've been thinking about this I realise we have another misapprehension here: the point of dma_parms is to expose a device's scatter-gather capabilities to _restrict_ what an IOMMU-based DMA API implementation can do (see 6b7b65105522) - thus setting fake "restrictions" for non-scatter-gather hardware in an attempt to force an implementation into merging segments is entirely backwards. >>>>> However, if there's a real dependency on IOMMUs and tight control of >>>>> IOVA allocation here, then the DMA API isn't really the right tool for the >>>>> job, and maybe it's time to start looking to how to better fit these >>>>> multimedia-subsystem-type use cases into the IOMMU API - as far as I >>>>> understand it there's at least some conceptual overlap with the HSA PASID >>>>> stuff being prototyped in PCI/x86-land at the moment, so it could be an >>>>> apposite time to try and bang out some common requirements. >>> >>> >>> The DMA API is actually the only good tool to use here to keep the >>> videobuf2-dma-contig code away from the knowledge about platform >>> specific data, e.g. presence of IOMMU. The only thing it knows is that >>> the target hardware requires a single contiguous buffer and it relies >>> on the fact that in correct cases the buffer given to it will meet >>> this requirement (i.e. physically contiguous w/o IOMMU; CPU mappable >>> with IOMMU). >> >> >> As above; the DMA API guarantees only what the DMA API guarantees. An >> IOMMU-based implementation of streaming DMA is free to identity-map pages if >> it only cares about device isolation; a non-IOMMU implementation is free to >> provide streaming DMA remapping via some elaborate bounce-buffering scheme > > I guess this is the area where our understandings of IOMMU-backed DMA > API differ. The DMA API provides a hardware-independent abstraction of a set of operations for exposing kernel memory to devices. When someone calls a DMA API function, they don't get to choose the details of that abstraction, and they don't get to choose the semantics of those operations. Of course they can always go ahead and propose adding something to the API, if they really believe there's something else it needs to offer. >> if it really wants to. GART-type IOMMUs... let's not even go there. > > I believe that's how IOMMU-based implementation of DMA API was > supposed to work when first implemented for ARM... > >> If v4l needs a guarantee of a single contiguous DMA buffer, then it needs to >> use dma_alloc_coherent() for that, not streaming mappings. > > Except that it can't use it, because the buffers are already allocated > by another entity. dma_alloc_coherent(... for_each_sg(.. memcpy(... Or v4l is rearchitected such that the userspace pages came from mmap()ing a guaranteed-contiguous DMA buffer in the first place. Or vb2-dma-contig is rearchitected to use the IOMMU API directly where it has an IOMMU dependency. Or someone posts a patch to extend the DMA API with a dma_try_to_map_sg_as_contiguously_as_you_can_manage() operation that doesn't even necessarily have to depend on an IOMMU... Plenty of ways to replace incorrect assumptions with reliable ones. Or to put it another way; Fast, Easy to implement, Correct: pick two. With the caveat that for upstream, one of the two _must_ be "Correct". Robin. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping 2015-10-30 1:17 ` [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping Daniel Kurtz 2015-10-30 14:09 ` Joerg Roedel 2015-10-30 14:27 ` Robin Murphy @ 2015-11-17 12:02 ` Marek Szyprowski 2 siblings, 0 replies; 15+ messages in thread From: Marek Szyprowski @ 2015-11-17 12:02 UTC (permalink / raw) To: Daniel Kurtz, Robin Murphy, Pawel Osciak Cc: Yong Wu, Joerg Roedel, Will Deacon, Catalin Marinas, open list:IOMMU DRIVERS, linux-arm-kernel@lists.infradead.org, thunder.leizhen, Yingjoe Chen, laurent.pinchart+renesas, Thierry Reding, Lin PoChun, Bobby Batacharia (via Google Docs), linux-media, Kyungmin Park, Tomasz Figa, Russell King - ARM Linux, Bartlomiej Zolnierkiewicz Hello, I'm really sorry do late joining this discussion, but I was terribly busy with other things. On 2015-10-30 02:17, Daniel Kurtz wrote: > +linux-media & VIDEOBUF2 FRAMEWORK maintainers since this is about the > v4l2-contig's usage of the DMA API. > > Hi Robin, > > On Tue, Oct 27, 2015 at 12:55 AM, Robin Murphy <robin.murphy@arm.com> wrote: >> On 26/10/15 13:44, Yong Wu wrote: >>> On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote: >>> [...] >>>> +/* >>>> + * The DMA API client is passing in a scatterlist which could describe >>>> + * any old buffer layout, but the IOMMU API requires everything to be >>>> + * aligned to IOMMU pages. Hence the need for this complicated bit of >>>> + * impedance-matching, to be able to hand off a suitably-aligned list, >>>> + * but still preserve the original offsets and sizes for the caller. >>>> + */ >>>> +int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, >>>> + int nents, int prot) >>>> +{ >>>> + struct iommu_domain *domain = iommu_get_domain_for_dev(dev); >>>> + struct iova_domain *iovad = domain->iova_cookie; >>>> + struct iova *iova; >>>> + struct scatterlist *s, *prev = NULL; >>>> + dma_addr_t dma_addr; >>>> + size_t iova_len = 0; >>>> + int i; >>>> + >>>> + /* >>>> + * Work out how much IOVA space we need, and align the segments >>>> to >>>> + * IOVA granules for the IOMMU driver to handle. With some clever >>>> + * trickery we can modify the list in-place, but reversibly, by >>>> + * hiding the original data in the as-yet-unused DMA fields. >>>> + */ >>>> + for_each_sg(sg, s, nents, i) { >>>> + size_t s_offset = iova_offset(iovad, s->offset); >>>> + size_t s_length = s->length; >>>> + >>>> + sg_dma_address(s) = s->offset; >>>> + sg_dma_len(s) = s_length; >>>> + s->offset -= s_offset; >>>> + s_length = iova_align(iovad, s_length + s_offset); >>>> + s->length = s_length; >>>> + >>>> + /* >>>> + * The simple way to avoid the rare case of a segment >>>> + * crossing the boundary mask is to pad the previous one >>>> + * to end at a naturally-aligned IOVA for this one's >>>> size, >>>> + * at the cost of potentially over-allocating a little. >>>> + */ >>>> + if (prev) { >>>> + size_t pad_len = roundup_pow_of_two(s_length); >>>> + >>>> + pad_len = (pad_len - iova_len) & (pad_len - 1); >>>> + prev->length += pad_len; >>> >>> Hi Robin, >>> While our v4l2 testing, It seems that we met a problem here. >>> Here we update prev->length again, Do we need update >>> sg_dma_len(prev) again too? >>> >>> Some function like vb2_dc_get_contiguous_size[1] always get >>> sg_dma_len(s) to compare instead of s->length. so it may break >>> unexpectedly while sg_dma_len(s) is not same with s->length. >> >> This is just tweaking the faked-up length that we hand off to iommu_map_sg() >> (see also the iova_align() above), to trick it into bumping this segment up >> to a suitable starting IOVA. The real length at this point is stashed in >> sg_dma_len(s), and will be copied back into s->length in __finalise_sg(), so >> both will hold the same true length once we return to the caller. >> >> Yes, it does mean that if you have a list where the segment lengths are page >> aligned but not monotonically decreasing, e.g. {64k, 16k, 64k}, then you'll >> still end up with a gap between the second and third segments, but that's >> fine because the DMA API offers no guarantees about what the resulting DMA >> addresses will be (consider the no-IOMMU case where they would each just be >> "mapped" to their physical address). If that breaks v4l, then it's probably >> v4l's DMA API use that needs looking at (again). > Hmm, I thought the DMA API maps a (possibly) non-contiguous set of > memory pages into a contiguous block in device memory address space. > This would allow passing a dma mapped buffer to device dma using just > a device address and length. > IIUC, the change above breaks this model by inserting gaps in how the > buffer is mapped to device memory, such that the buffer is no longer > contiguous in dma address space. > > Here is the code in question from > drivers/media/v4l2-core/videobuf2-dma-contig.c : > > static unsigned long vb2_dc_get_contiguous_size(struct sg_table *sgt) > { > struct scatterlist *s; > dma_addr_t expected = sg_dma_address(sgt->sgl); > unsigned int i; > unsigned long size = 0; > > for_each_sg(sgt->sgl, s, sgt->nents, i) { > if (sg_dma_address(s) != expected) > break; > expected = sg_dma_address(s) + sg_dma_len(s); > size += sg_dma_len(s); > } > return size; > } > > > static void *vb2_dc_get_userptr(void *alloc_ctx, unsigned long vaddr, > unsigned long size, enum dma_data_direction dma_dir) > { > struct vb2_dc_conf *conf = alloc_ctx; > struct vb2_dc_buf *buf; > struct frame_vector *vec; > unsigned long offset; > int n_pages, i; > int ret = 0; > struct sg_table *sgt; > unsigned long contig_size; > unsigned long dma_align = dma_get_cache_alignment(); > DEFINE_DMA_ATTRS(attrs); > > dma_set_attr(DMA_ATTR_SKIP_CPU_SYNC, &attrs); > > buf = kzalloc(sizeof *buf, GFP_KERNEL); > buf->dma_dir = dma_dir; > > offset = vaddr & ~PAGE_MASK; > vec = vb2_create_framevec(vaddr, size, dma_dir == DMA_FROM_DEVICE); > buf->vec = vec; > n_pages = frame_vector_count(vec); > > sgt = kzalloc(sizeof(*sgt), GFP_KERNEL); > > ret = sg_alloc_table_from_pages(sgt, frame_vector_pages(vec), n_pages, > offset, size, GFP_KERNEL); > > sgt->nents = dma_map_sg_attrs(buf->dev, sgt->sgl, sgt->orig_nents, > buf->dma_dir, &attrs); > > contig_size = vb2_dc_get_contiguous_size(sgt); > if (contig_size < size) { > > <<<=== if the original buffer had sg entries that were not > aligned on the "natural" alignment for their size, the new arm64 iommu > core code inserts a 'gap' in the iommu mapping, which causes > vb2_dc_get_contiguous_size() to exit early (and return a smaller size > than expected). > > pr_err("contiguous mapping is too small %lu/%lu\n", > contig_size, size); > ret = -EFAULT; > goto fail_map_sg; > } > > > So, is the videobuf2-dma-contig.c based on an incorrect assumption > about how the DMA API is supposed to work? > Is it even possible to map a "contiguous-in-iova-range" mapping for a > buffer given as an sg_table with an arbitrary set of pages? > > Thanks for helping to move this forward. As I'm responsible for the code of both dma-mapping IOMMU integration code for ARM and videobuf2-dc subdriver, I would like to share the background behind them. This code is a result of our (Samsung R&D Institute Poland) works on mainlining drivers for various multimedia devices found in Exynos SoCs. All those devices can only process buffers, which are contiguous in the DMA address space. This is requirement was one of the fundamental reason for using IOMMU modules. However it turned out that there was no straightforward way of integrating it for our purposes and some extension to core frameworks were needed. Our initial proposal integrated IOMMU drivers directly to the V4L2 helper code, as a separate memory managing subdriver for videobuf2: http://www.spinics.net/lists/linux-media/msg31455.html Then I've been suggested to use dma-mapping API and hide IOMMU behind it. Allocating a buffer suitable for DMA with IOMMU mapper enabled was easy. However creating a contiguous mapping in DMA address space for a buffer scattered in the physical memory was still a bit tricky. The only possible way was to use scatter-list and assume that DMA-mapping will do it right (will create only one segment for the whole scatter list). I found no other possibility. This part of the IOMMU and DMA-mapping was especially problematic when buffer sharing (dma-buf) has been introduced and it turned out that there are different ways of interpreting scatter lists done by different drivers. From the time I see that I focused mainly on re-using existing DMA-mapping API, what shouldn't be the main goal. This resulted in somehow tricky way of doing one of the most common operation for existing multimedia devices. Scatter lists are also a bit over-engineered for doing a such simple operation like mapping scattered memory into single contiguous DMA address space. They waste memory for storing useless parameters like per-page offset and dma address/len. Maybe it would be better if something like page vector (or PFN vector to solve the problem of mapping buffers that cannot be described by pages) have been introduced and operations like dma_map_vector() will make thing much more clear. I can provide a proof-of-concept code for further discussion if needed. Best regards -- Marek Szyprowski, PhD Samsung R&D Institute Poland ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2015-11-17 12:02 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <cover.1443718557.git.robin.murphy@arm.com>
[not found] ` <ab8e1caa40d6da1afa4a49f30242ef4e6e1f17df.1443718557.git.robin.murphy@arm.com>
[not found] ` <1445867094.30736.14.camel@mhfsdcap03>
[not found] ` <562E5AE4.9070001@arm.com>
2015-10-30 1:17 ` [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping Daniel Kurtz
2015-10-30 14:09 ` Joerg Roedel
2015-10-30 14:27 ` Robin Murphy
2015-11-02 13:11 ` Daniel Kurtz
2015-11-02 13:43 ` Tomasz Figa
2015-11-03 17:41 ` Robin Murphy
2015-11-03 18:40 ` Russell King - ARM Linux
2015-11-04 5:15 ` Tomasz Figa
2015-11-04 9:10 ` Russell King - ARM Linux
2015-11-04 5:12 ` Tomasz Figa
2015-11-04 9:27 ` Russell King - ARM Linux
2015-11-04 9:48 ` Tomasz Figa
2015-11-04 10:50 ` Russell King - ARM Linux
2015-11-09 13:11 ` Robin Murphy
2015-11-17 12:02 ` Marek Szyprowski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).