* [PATCH v4 0/5] Fix double allocation in swiotlb_alloc()
@ 2024-02-21 11:34 Will Deacon
2024-02-21 11:35 ` [PATCH v4 1/5] swiotlb: Fix double-allocation of slots due to broken alignment handling Will Deacon
` (5 more replies)
0 siblings, 6 replies; 27+ messages in thread
From: Will Deacon @ 2024-02-21 11:34 UTC (permalink / raw)
To: linux-kernel
Cc: kernel-team, Will Deacon, iommu, Christoph Hellwig,
Marek Szyprowski, Robin Murphy, Petr Tesarik, Dexuan Cui,
Nicolin Chen, Michael Kelley
Hi again, folks,
This is version four of the patches which I previously posted at:
v1: https://lore.kernel.org/r/20240126151956.10014-1-will@kernel.org
v2: https://lore.kernel.org/r/20240131122543.14791-1-will@kernel.org
v3: https://lore.kernel.org/r/20240205190127.20685-1-will@kernel.org
Thanks to Petr for his Reviewed-by tag on the first three.
Changes since v3 include:
- Use umax() instead of max() to fix a build warning if the first
patch is applied to older kernels which warn on signedness
mismatches.
- Add two new patches to the end of the series to resolve some
additional issues with NVME and 64KiB pages, reported by Nicolin.
I've added them to this series, as the first three patches make it
easier to fix this problem in the SWIOTLB code.
- Add Reviewed-by tags from Petr
Cheers,
Will
Cc: iommu@lists.linux.dev
Cc: Christoph Hellwig <hch@lst.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Petr Tesarik <petr.tesarik1@huawei-partners.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Nicolin Chen <nicolinc@nvidia.com>
Cc: Michael Kelley <mhklinux@outlook.com>
--->8
Nicolin Chen (1):
iommu/dma: Force swiotlb_max_mapping_size on an untrusted device
Will Deacon (4):
swiotlb: Fix double-allocation of slots due to broken alignment
handling
swiotlb: Enforce page alignment in swiotlb_alloc()
swiotlb: Honour dma_alloc_coherent() alignment in swiotlb_alloc()
swiotlb: Fix alignment checks when both allocation and DMA masks are
present
drivers/iommu/dma-iommu.c | 8 ++++++++
kernel/dma/swiotlb.c | 43 +++++++++++++++++++++++++++------------
2 files changed, 38 insertions(+), 13 deletions(-)
--
2.44.0.rc0.258.g7320e95886-goog
^ permalink raw reply [flat|nested] 27+ messages in thread* [PATCH v4 1/5] swiotlb: Fix double-allocation of slots due to broken alignment handling 2024-02-21 11:34 [PATCH v4 0/5] Fix double allocation in swiotlb_alloc() Will Deacon @ 2024-02-21 11:35 ` Will Deacon 2024-02-21 23:35 ` Michael Kelley 2024-02-21 11:35 ` [PATCH v4 2/5] swiotlb: Enforce page alignment in swiotlb_alloc() Will Deacon ` (4 subsequent siblings) 5 siblings, 1 reply; 27+ messages in thread From: Will Deacon @ 2024-02-21 11:35 UTC (permalink / raw) To: linux-kernel Cc: kernel-team, Will Deacon, iommu, Christoph Hellwig, Marek Szyprowski, Robin Murphy, Petr Tesarik, Dexuan Cui, Nicolin Chen, Michael Kelley Commit bbb73a103fbb ("swiotlb: fix a braino in the alignment check fix"), which was a fix for commit 0eee5ae10256 ("swiotlb: fix slot alignment checks"), causes a functional regression with vsock in a virtual machine using bouncing via a restricted DMA SWIOTLB pool. When virtio allocates the virtqueues for the vsock device using dma_alloc_coherent(), the SWIOTLB search can return page-unaligned allocations if 'area->index' was left unaligned by a previous allocation from the buffer: # Final address in brackets is the SWIOTLB address returned to the caller | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask 0x800 stride 0x2: got slot 1645-1649/7168 (0x98326800) | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask 0x800 stride 0x2: got slot 1649-1653/7168 (0x98328800) | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask 0x800 stride 0x2: got slot 1653-1657/7168 (0x9832a800) This ends badly (typically buffer corruption and/or a hang) because swiotlb_alloc() is expecting a page-aligned allocation and so blindly returns a pointer to the 'struct page' corresponding to the allocation, therefore double-allocating the first half (2KiB slot) of the 4KiB page. Fix the problem by treating the allocation alignment separately to any additional alignment requirements from the device, using the maximum of the two as the stride to search the buffer slots and taking care to ensure a minimum of page-alignment for buffers larger than a page. Fixes: bbb73a103fbb ("swiotlb: fix a braino in the alignment check fix") Fixes: 0eee5ae10256 ("swiotlb: fix slot alignment checks") Cc: Christoph Hellwig <hch@lst.de> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Dexuan Cui <decui@microsoft.com> Reviewed-by: Petr Tesarik <petr.tesarik1@huawei-partners.com> Signed-off-by: Will Deacon <will@kernel.org> --- kernel/dma/swiotlb.c | 28 +++++++++++++++------------- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index b079a9a8e087..2ec2cc81f1a2 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -982,7 +982,7 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool phys_to_dma_unencrypted(dev, pool->start) & boundary_mask; unsigned long max_slots = get_max_slots(boundary_mask); unsigned int iotlb_align_mask = - dma_get_min_align_mask(dev) | alloc_align_mask; + dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1); unsigned int nslots = nr_slots(alloc_size), stride; unsigned int offset = swiotlb_align_offset(dev, orig_addr); unsigned int index, slots_checked, count = 0, i; @@ -993,19 +993,18 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool BUG_ON(!nslots); BUG_ON(area_index >= pool->nareas); + /* + * For mappings with an alignment requirement don't bother looping to + * unaligned slots once we found an aligned one. + */ + stride = get_max_slots(max(alloc_align_mask, iotlb_align_mask)); + /* * For allocations of PAGE_SIZE or larger only look for page aligned * allocations. */ if (alloc_size >= PAGE_SIZE) - iotlb_align_mask |= ~PAGE_MASK; - iotlb_align_mask &= ~(IO_TLB_SIZE - 1); - - /* - * For mappings with an alignment requirement don't bother looping to - * unaligned slots once we found an aligned one. - */ - stride = (iotlb_align_mask >> IO_TLB_SHIFT) + 1; + stride = umax(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1); spin_lock_irqsave(&area->lock, flags); if (unlikely(nslots > pool->area_nslabs - area->used)) @@ -1015,11 +1014,14 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool index = area->index; for (slots_checked = 0; slots_checked < pool->area_nslabs; ) { - slot_index = slot_base + index; + phys_addr_t tlb_addr; - if (orig_addr && - (slot_addr(tbl_dma_addr, slot_index) & - iotlb_align_mask) != (orig_addr & iotlb_align_mask)) { + slot_index = slot_base + index; + tlb_addr = slot_addr(tbl_dma_addr, slot_index); + + if ((tlb_addr & alloc_align_mask) || + (orig_addr && (tlb_addr & iotlb_align_mask) != + (orig_addr & iotlb_align_mask))) { index = wrap_area_index(pool, index + 1); slots_checked++; continue; -- 2.44.0.rc0.258.g7320e95886-goog ^ permalink raw reply related [flat|nested] 27+ messages in thread
* RE: [PATCH v4 1/5] swiotlb: Fix double-allocation of slots due to broken alignment handling 2024-02-21 11:35 ` [PATCH v4 1/5] swiotlb: Fix double-allocation of slots due to broken alignment handling Will Deacon @ 2024-02-21 23:35 ` Michael Kelley 2024-02-23 12:47 ` Will Deacon 0 siblings, 1 reply; 27+ messages in thread From: Michael Kelley @ 2024-02-21 23:35 UTC (permalink / raw) To: Will Deacon, linux-kernel@vger.kernel.org Cc: kernel-team@android.com, iommu@lists.linux.dev, Christoph Hellwig, Marek Szyprowski, Robin Murphy, Petr Tesarik, Dexuan Cui, Nicolin Chen From: Will Deacon <will@kernel.org> Sent: Wednesday, February 21, 2024 3:35 AM > > Commit bbb73a103fbb ("swiotlb: fix a braino in the alignment check fix"), > which was a fix for commit 0eee5ae10256 ("swiotlb: fix slot alignment > checks"), causes a functional regression with vsock in a virtual machine > using bouncing via a restricted DMA SWIOTLB pool. > > When virtio allocates the virtqueues for the vsock device using > dma_alloc_coherent(), the SWIOTLB search can return page-unaligned > allocations if 'area->index' was left unaligned by a previous allocation > from the buffer: > > # Final address in brackets is the SWIOTLB address returned to the caller > | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask > 0x800 stride 0x2: got slot 1645-1649/7168 (0x98326800) > | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask > 0x800 stride 0x2: got slot 1649-1653/7168 (0x98328800) > | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask > 0x800 stride 0x2: got slot 1653-1657/7168 (0x9832a800) > > This ends badly (typically buffer corruption and/or a hang) because > swiotlb_alloc() is expecting a page-aligned allocation and so blindly > returns a pointer to the 'struct page' corresponding to the allocation, > therefore double-allocating the first half (2KiB slot) of the 4KiB page. > > Fix the problem by treating the allocation alignment separately to any > additional alignment requirements from the device, using the maximum > of the two as the stride to search the buffer slots and taking care > to ensure a minimum of page-alignment for buffers larger than a page. Could you also add some text that this patch fixes the scenario I described in the other email thread? Something like: The changes to page alignment handling also fix a problem when the alloc_align_mask is zero. The page alignment handling added in the two mentioned commits could force alignment to more bits in orig_addr than specified by the device's DMA min_align_mask, resulting in a larger offset. Since swiotlb_max_mapping_size() is based only on the DMA min_align_mask, that larger offset plus the requested size could exceed IO_TLB_SEGSIZE slots, and the mapping could fail when it shouldn't. > > Fixes: bbb73a103fbb ("swiotlb: fix a braino in the alignment check fix") > Fixes: 0eee5ae10256 ("swiotlb: fix slot alignment checks") > Cc: Christoph Hellwig <hch@lst.de> > Cc: Marek Szyprowski <m.szyprowski@samsung.com> > Cc: Robin Murphy <robin.murphy@arm.com> > Cc: Dexuan Cui <decui@microsoft.com> > Reviewed-by: Petr Tesarik <petr.tesarik1@huawei-partners.com> > Signed-off-by: Will Deacon <will@kernel.org> > --- > kernel/dma/swiotlb.c | 28 +++++++++++++++------------- > 1 file changed, 15 insertions(+), 13 deletions(-) > > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c > index b079a9a8e087..2ec2cc81f1a2 100644 > --- a/kernel/dma/swiotlb.c > +++ b/kernel/dma/swiotlb.c > @@ -982,7 +982,7 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool > phys_to_dma_unencrypted(dev, pool->start) & boundary_mask; > unsigned long max_slots = get_max_slots(boundary_mask); > unsigned int iotlb_align_mask = > - dma_get_min_align_mask(dev) | alloc_align_mask; > + dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1); > unsigned int nslots = nr_slots(alloc_size), stride; > unsigned int offset = swiotlb_align_offset(dev, orig_addr); > unsigned int index, slots_checked, count = 0, i; > @@ -993,19 +993,18 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool > BUG_ON(!nslots); > BUG_ON(area_index >= pool->nareas); > > + /* > + * For mappings with an alignment requirement don't bother looping to > + * unaligned slots once we found an aligned one. > + */ > + stride = get_max_slots(max(alloc_align_mask, iotlb_align_mask)); > + > /* > * For allocations of PAGE_SIZE or larger only look for page aligned > * allocations. > */ > if (alloc_size >= PAGE_SIZE) > - iotlb_align_mask |= ~PAGE_MASK; > - iotlb_align_mask &= ~(IO_TLB_SIZE - 1); > - > - /* > - * For mappings with an alignment requirement don't bother looping to > - * unaligned slots once we found an aligned one. > - */ > - stride = (iotlb_align_mask >> IO_TLB_SHIFT) + 1; > + stride = umax(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1); Is this special handling of alloc_size >= PAGE_SIZE really needed? I think the comment is somewhat inaccurate. If orig_addr is non-zero, and alloc_align_mask is zero, the requirement is for the alignment to match the DMA min_align_mask bits in orig_addr, even if the allocation is larger than a page. And with Patch 3 of this series, the swiotlb_alloc() case passes in alloc_align_mask to handle page size and larger requests. So it seems like this doesn't do anything useful unless orig_addr and alloc_align_mask are both zero, and there aren't any cases of that after this patch series. If the caller wants alignment, specify it with alloc_align_mask. > > spin_lock_irqsave(&area->lock, flags); > if (unlikely(nslots > pool->area_nslabs - area->used)) > @@ -1015,11 +1014,14 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool > index = area->index; > > for (slots_checked = 0; slots_checked < pool->area_nslabs; ) { > - slot_index = slot_base + index; > + phys_addr_t tlb_addr; > > - if (orig_addr && > - (slot_addr(tbl_dma_addr, slot_index) & > - iotlb_align_mask) != (orig_addr & iotlb_align_mask)) { > + slot_index = slot_base + index; > + tlb_addr = slot_addr(tbl_dma_addr, slot_index); > + > + if ((tlb_addr & alloc_align_mask) || > + (orig_addr && (tlb_addr & iotlb_align_mask) != > + (orig_addr & iotlb_align_mask))) { It looks like these changes will cause a mapping failure in some iommu_dma_map_page() cases that previously didn't fail. Everything is made right by Patch 4 of your series, but from a bisect standpoint, there will be a gap where things are worse. In [1], I think Nicolin reported a crash with just this patch applied. While the iommu_dma_map_page() case can already fail due to "too large" requests because of not setting a max mapping size, this patch can cause smaller requests to fail as well until Patch 4 gets applied. That might be problem to avoid, perhaps by merging the Patch 4 changes into this patch. Michael [1] https://lore.kernel.org/linux-iommu/SN6PR02MB415727E61B5295C259CCB268D4512@SN6PR02MB4157.namprd02.prod.outlook.com/T/#m0ec36324b17947adefc18b3ac715e1952150f89d > index = wrap_area_index(pool, index + 1); > slots_checked++; > continue; > -- > 2.44.0.rc0.258.g7320e95886-goog ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 1/5] swiotlb: Fix double-allocation of slots due to broken alignment handling 2024-02-21 23:35 ` Michael Kelley @ 2024-02-23 12:47 ` Will Deacon 2024-02-23 13:36 ` Petr Tesařík ` (2 more replies) 0 siblings, 3 replies; 27+ messages in thread From: Will Deacon @ 2024-02-23 12:47 UTC (permalink / raw) To: Michael Kelley Cc: linux-kernel@vger.kernel.org, kernel-team@android.com, iommu@lists.linux.dev, Christoph Hellwig, Marek Szyprowski, Robin Murphy, Petr Tesarik, Dexuan Cui, Nicolin Chen On Wed, Feb 21, 2024 at 11:35:44PM +0000, Michael Kelley wrote: > From: Will Deacon <will@kernel.org> Sent: Wednesday, February 21, 2024 3:35 AM > > > > Commit bbb73a103fbb ("swiotlb: fix a braino in the alignment check fix"), > > which was a fix for commit 0eee5ae10256 ("swiotlb: fix slot alignment > > checks"), causes a functional regression with vsock in a virtual machine > > using bouncing via a restricted DMA SWIOTLB pool. > > > > When virtio allocates the virtqueues for the vsock device using > > dma_alloc_coherent(), the SWIOTLB search can return page-unaligned > > allocations if 'area->index' was left unaligned by a previous allocation > > from the buffer: > > > > # Final address in brackets is the SWIOTLB address returned to the caller > > | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask > > 0x800 stride 0x2: got slot 1645-1649/7168 (0x98326800) > > | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask > > 0x800 stride 0x2: got slot 1649-1653/7168 (0x98328800) > > | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask > > 0x800 stride 0x2: got slot 1653-1657/7168 (0x9832a800) > > > > This ends badly (typically buffer corruption and/or a hang) because > > swiotlb_alloc() is expecting a page-aligned allocation and so blindly > > returns a pointer to the 'struct page' corresponding to the allocation, > > therefore double-allocating the first half (2KiB slot) of the 4KiB page. > > > > Fix the problem by treating the allocation alignment separately to any > > additional alignment requirements from the device, using the maximum > > of the two as the stride to search the buffer slots and taking care > > to ensure a minimum of page-alignment for buffers larger than a page. > > Could you also add some text that this patch fixes the scenario I > described in the other email thread? Something like: > > The changes to page alignment handling also fix a problem when > the alloc_align_mask is zero. The page alignment handling added > in the two mentioned commits could force alignment to more bits > in orig_addr than specified by the device's DMA min_align_mask, > resulting in a larger offset. Since swiotlb_max_mapping_size() > is based only on the DMA min_align_mask, that larger offset > plus the requested size could exceed IO_TLB_SEGSIZE slots, and > the mapping could fail when it shouldn't. Thanks, Michael. I can add that in. > > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c > > index b079a9a8e087..2ec2cc81f1a2 100644 > > --- a/kernel/dma/swiotlb.c > > +++ b/kernel/dma/swiotlb.c > > @@ -982,7 +982,7 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool > > phys_to_dma_unencrypted(dev, pool->start) & boundary_mask; > > unsigned long max_slots = get_max_slots(boundary_mask); > > unsigned int iotlb_align_mask = > > - dma_get_min_align_mask(dev) | alloc_align_mask; > > + dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1); > > unsigned int nslots = nr_slots(alloc_size), stride; > > unsigned int offset = swiotlb_align_offset(dev, orig_addr); > > unsigned int index, slots_checked, count = 0, i; > > @@ -993,19 +993,18 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool > > BUG_ON(!nslots); > > BUG_ON(area_index >= pool->nareas); > > > > + /* > > + * For mappings with an alignment requirement don't bother looping to > > + * unaligned slots once we found an aligned one. > > + */ > > + stride = get_max_slots(max(alloc_align_mask, iotlb_align_mask)); > > + > > /* > > * For allocations of PAGE_SIZE or larger only look for page aligned > > * allocations. > > */ > > if (alloc_size >= PAGE_SIZE) > > - iotlb_align_mask |= ~PAGE_MASK; > > - iotlb_align_mask &= ~(IO_TLB_SIZE - 1); > > - > > - /* > > - * For mappings with an alignment requirement don't bother looping to > > - * unaligned slots once we found an aligned one. > > - */ > > - stride = (iotlb_align_mask >> IO_TLB_SHIFT) + 1; > > + stride = umax(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1); > > Is this special handling of alloc_size >= PAGE_SIZE really needed? I've been wondering that as well, but please note that this code (and the comment) are in the upstream code, so I was erring in favour of keeping that while fixing the bugs. We could have an extra patch dropping it if we can convince ourselves that it's not adding anything, though. > I think the comment is somewhat inaccurate. If orig_addr is non-zero, and > alloc_align_mask is zero, the requirement is for the alignment to match > the DMA min_align_mask bits in orig_addr, even if the allocation is > larger than a page. And with Patch 3 of this series, the swiotlb_alloc() > case passes in alloc_align_mask to handle page size and larger requests. > So it seems like this doesn't do anything useful unless orig_addr and > alloc_align_mask are both zero, and there aren't any cases of that > after this patch series. If the caller wants alignment, specify > it with alloc_align_mask. It's an interesting observation. Presumably the intention here is to reduce the cost of the linear search, but the code originates from a time when we didn't have iotlb_align_mask or alloc_align_mask and so I tend to agree that it should probably just be dropped. I'm also not even convinced that it works properly if the initial search index ends up being 2KiB (i.e. slot) aligned -- we'll end up jumping over the page-aligned addresses! I'll add another patch to v5 which removes this check (and you've basically written the commit message for me, so thanks). > > spin_lock_irqsave(&area->lock, flags); > > if (unlikely(nslots > pool->area_nslabs - area->used)) > > @@ -1015,11 +1014,14 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool > > index = area->index; > > > > for (slots_checked = 0; slots_checked < pool->area_nslabs; ) { > > - slot_index = slot_base + index; > > + phys_addr_t tlb_addr; > > > > - if (orig_addr && > > - (slot_addr(tbl_dma_addr, slot_index) & > > - iotlb_align_mask) != (orig_addr & iotlb_align_mask)) { > > + slot_index = slot_base + index; > > + tlb_addr = slot_addr(tbl_dma_addr, slot_index); > > + > > + if ((tlb_addr & alloc_align_mask) || > > + (orig_addr && (tlb_addr & iotlb_align_mask) != > > + (orig_addr & iotlb_align_mask))) { > > It looks like these changes will cause a mapping failure in some > iommu_dma_map_page() cases that previously didn't fail. Hmm, it's really hard to tell. This code has been quite badly broken for some time, so I'm not sure how far back you have to go to find a kernel that would work properly (e.g. for Nicolin's case with 64KiB pages). > Everything is made right by Patch 4 of your series, but from a > bisect standpoint, there will be a gap where things are worse. > In [1], I think Nicolin reported a crash with just this patch applied. In Nicolin's case, I think it didn't work without the patch either, this just triggered the failure earlier. > While the iommu_dma_map_page() case can already fail due to > "too large" requests because of not setting a max mapping size, > this patch can cause smaller requests to fail as well until Patch 4 > gets applied. That might be problem to avoid, perhaps by > merging the Patch 4 changes into this patch. I'll leave this up to Christoph. Personally, I'm keen to avoid having a giant patch trying to fix all the SWIOTLB allocation issues in one go, as it will inevitably get reverted due to a corner case that we weren't able to test properly, breaking the common cases at the same time. Will ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 1/5] swiotlb: Fix double-allocation of slots due to broken alignment handling 2024-02-23 12:47 ` Will Deacon @ 2024-02-23 13:36 ` Petr Tesařík 2024-02-23 17:04 ` Michael Kelley 2024-02-27 15:38 ` Christoph Hellwig 2 siblings, 0 replies; 27+ messages in thread From: Petr Tesařík @ 2024-02-23 13:36 UTC (permalink / raw) To: Will Deacon Cc: Michael Kelley, linux-kernel@vger.kernel.org, kernel-team@android.com, iommu@lists.linux.dev, Christoph Hellwig, Marek Szyprowski, Robin Murphy, Petr Tesarik, Dexuan Cui, Nicolin Chen On Fri, 23 Feb 2024 12:47:43 +0000 Will Deacon <will@kernel.org> wrote: > On Wed, Feb 21, 2024 at 11:35:44PM +0000, Michael Kelley wrote: > > From: Will Deacon <will@kernel.org> Sent: Wednesday, February 21, 2024 3:35 AM > > > > > > Commit bbb73a103fbb ("swiotlb: fix a braino in the alignment check fix"), > > > which was a fix for commit 0eee5ae10256 ("swiotlb: fix slot alignment > > > checks"), causes a functional regression with vsock in a virtual machine > > > using bouncing via a restricted DMA SWIOTLB pool. > > > > > > When virtio allocates the virtqueues for the vsock device using > > > dma_alloc_coherent(), the SWIOTLB search can return page-unaligned > > > allocations if 'area->index' was left unaligned by a previous allocation > > > from the buffer: > > > > > > # Final address in brackets is the SWIOTLB address returned to the caller > > > | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask > > > 0x800 stride 0x2: got slot 1645-1649/7168 (0x98326800) > > > | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask > > > 0x800 stride 0x2: got slot 1649-1653/7168 (0x98328800) > > > | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask > > > 0x800 stride 0x2: got slot 1653-1657/7168 (0x9832a800) > > > > > > This ends badly (typically buffer corruption and/or a hang) because > > > swiotlb_alloc() is expecting a page-aligned allocation and so blindly > > > returns a pointer to the 'struct page' corresponding to the allocation, > > > therefore double-allocating the first half (2KiB slot) of the 4KiB page. > > > > > > Fix the problem by treating the allocation alignment separately to any > > > additional alignment requirements from the device, using the maximum > > > of the two as the stride to search the buffer slots and taking care > > > to ensure a minimum of page-alignment for buffers larger than a page. > > > > Could you also add some text that this patch fixes the scenario I > > described in the other email thread? Something like: > > > > The changes to page alignment handling also fix a problem when > > the alloc_align_mask is zero. The page alignment handling added > > in the two mentioned commits could force alignment to more bits > > in orig_addr than specified by the device's DMA min_align_mask, > > resulting in a larger offset. Since swiotlb_max_mapping_size() > > is based only on the DMA min_align_mask, that larger offset > > plus the requested size could exceed IO_TLB_SEGSIZE slots, and > > the mapping could fail when it shouldn't. > > Thanks, Michael. I can add that in. > > > > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c > > > index b079a9a8e087..2ec2cc81f1a2 100644 > > > --- a/kernel/dma/swiotlb.c > > > +++ b/kernel/dma/swiotlb.c > > > @@ -982,7 +982,7 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool > > > phys_to_dma_unencrypted(dev, pool->start) & boundary_mask; > > > unsigned long max_slots = get_max_slots(boundary_mask); > > > unsigned int iotlb_align_mask = > > > - dma_get_min_align_mask(dev) | alloc_align_mask; > > > + dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1); > > > unsigned int nslots = nr_slots(alloc_size), stride; > > > unsigned int offset = swiotlb_align_offset(dev, orig_addr); > > > unsigned int index, slots_checked, count = 0, i; > > > @@ -993,19 +993,18 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool > > > BUG_ON(!nslots); > > > BUG_ON(area_index >= pool->nareas); > > > > > > + /* > > > + * For mappings with an alignment requirement don't bother looping to > > > + * unaligned slots once we found an aligned one. > > > + */ > > > + stride = get_max_slots(max(alloc_align_mask, iotlb_align_mask)); > > > + > > > /* > > > * For allocations of PAGE_SIZE or larger only look for page aligned > > > * allocations. > > > */ > > > if (alloc_size >= PAGE_SIZE) > > > - iotlb_align_mask |= ~PAGE_MASK; > > > - iotlb_align_mask &= ~(IO_TLB_SIZE - 1); > > > - > > > - /* > > > - * For mappings with an alignment requirement don't bother looping to > > > - * unaligned slots once we found an aligned one. > > > - */ > > > - stride = (iotlb_align_mask >> IO_TLB_SHIFT) + 1; > > > + stride = umax(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1); > > > > Is this special handling of alloc_size >= PAGE_SIZE really needed? > > I've been wondering that as well, but please note that this code (and the > comment) are in the upstream code, so I was erring in favour of keeping > that while fixing the bugs. We could have an extra patch dropping it if > we can convince ourselves that it's not adding anything, though. > > > I think the comment is somewhat inaccurate. If orig_addr is non-zero, and > > alloc_align_mask is zero, the requirement is for the alignment to match > > the DMA min_align_mask bits in orig_addr, even if the allocation is > > larger than a page. And with Patch 3 of this series, the swiotlb_alloc() > > case passes in alloc_align_mask to handle page size and larger requests. > > So it seems like this doesn't do anything useful unless orig_addr and > > alloc_align_mask are both zero, and there aren't any cases of that > > after this patch series. If the caller wants alignment, specify > > it with alloc_align_mask. > > It's an interesting observation. Presumably the intention here is to > reduce the cost of the linear search, but the code originates from a > time when we didn't have iotlb_align_mask or alloc_align_mask and so I > tend to agree that it should probably just be dropped. I'm also not even > convinced that it works properly if the initial search index ends up > being 2KiB (i.e. slot) aligned -- we'll end up jumping over the > page-aligned addresses! Originally, SWIOTLB was not used for allocations, so orig_addr was never zero. The assumption was that if the bounce buffer should be page-aligned, then the original buffer was also page-aligned, and the check against iotlb_align_mask was sufficient. > I'll add another patch to v5 which removes this check (and you've basically > written the commit message for me, so thanks). > > > > spin_lock_irqsave(&area->lock, flags); > > > if (unlikely(nslots > pool->area_nslabs - area->used)) > > > @@ -1015,11 +1014,14 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool > > > index = area->index; > > > > > > for (slots_checked = 0; slots_checked < pool->area_nslabs; ) { > > > - slot_index = slot_base + index; > > > + phys_addr_t tlb_addr; > > > > > > - if (orig_addr && > > > - (slot_addr(tbl_dma_addr, slot_index) & > > > - iotlb_align_mask) != (orig_addr & iotlb_align_mask)) { > > > + slot_index = slot_base + index; > > > + tlb_addr = slot_addr(tbl_dma_addr, slot_index); > > > + > > > + if ((tlb_addr & alloc_align_mask) || > > > + (orig_addr && (tlb_addr & iotlb_align_mask) != > > > + (orig_addr & iotlb_align_mask))) { > > > > It looks like these changes will cause a mapping failure in some > > iommu_dma_map_page() cases that previously didn't fail. > > Hmm, it's really hard to tell. This code has been quite badly broken for > some time, so I'm not sure how far back you have to go to find a kernel > that would work properly (e.g. for Nicolin's case with 64KiB pages). I believe it fails exactly in the cases that previously found an incorrectly aligned bounce buffer. In any case, the "middle" bits (low bits but ignoring offset inside a slot) of tlb_addr should indeed correspond to the middle bits of orig_addr. > > > Everything is made right by Patch 4 of your series, but from a > > bisect standpoint, there will be a gap where things are worse. > > In [1], I think Nicolin reported a crash with just this patch applied. > > In Nicolin's case, I think it didn't work without the patch either, this > just triggered the failure earlier. > > > While the iommu_dma_map_page() case can already fail due to > > "too large" requests because of not setting a max mapping size, > > this patch can cause smaller requests to fail as well until Patch 4 > > gets applied. That might be problem to avoid, perhaps by > > merging the Patch 4 changes into this patch. > > I'll leave this up to Christoph. Personally, I'm keen to avoid having > a giant patch trying to fix all the SWIOTLB allocation issues in one go, > as it will inevitably get reverted due to a corner case that we weren't > able to test properly, breaking the common cases at the same time. I tend to think that more patches are better, even though this patch alone does introduce some regressions. Petr T ^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: [PATCH v4 1/5] swiotlb: Fix double-allocation of slots due to broken alignment handling 2024-02-23 12:47 ` Will Deacon 2024-02-23 13:36 ` Petr Tesařík @ 2024-02-23 17:04 ` Michael Kelley 2024-02-27 15:38 ` Christoph Hellwig 2 siblings, 0 replies; 27+ messages in thread From: Michael Kelley @ 2024-02-23 17:04 UTC (permalink / raw) To: Will Deacon Cc: linux-kernel@vger.kernel.org, kernel-team@android.com, iommu@lists.linux.dev, Christoph Hellwig, Marek Szyprowski, Robin Murphy, Petr Tesarik, Dexuan Cui, Nicolin Chen From: Will Deacon <will@kernel.org> Sent: Friday, February 23, 2024 4:48 AM > On Wed, Feb 21, 2024 at 11:35:44PM +0000, Michael Kelley wrote: > > From: Will Deacon <will@kernel.org> Sent: Wednesday, February 21, 2024 3:35 AM [snip] > > > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c > > > index b079a9a8e087..2ec2cc81f1a2 100644 > > > --- a/kernel/dma/swiotlb.c > > > +++ b/kernel/dma/swiotlb.c > > > @@ -982,7 +982,7 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool > > > phys_to_dma_unencrypted(dev, pool->start) & boundary_mask; > > > unsigned long max_slots = get_max_slots(boundary_mask); > > > unsigned int iotlb_align_mask = > > > - dma_get_min_align_mask(dev) | alloc_align_mask; > > > + dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1); > > > unsigned int nslots = nr_slots(alloc_size), stride; > > > unsigned int offset = swiotlb_align_offset(dev, orig_addr); > > > unsigned int index, slots_checked, count = 0, i; > > > @@ -993,19 +993,18 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool > > > BUG_ON(!nslots); > > > BUG_ON(area_index >= pool->nareas); > > > > > > + /* > > > + * For mappings with an alignment requirement don't bother looping to > > > + * unaligned slots once we found an aligned one. > > > + */ > > > + stride = get_max_slots(max(alloc_align_mask, iotlb_align_mask)); > > > + > > > /* > > > * For allocations of PAGE_SIZE or larger only look for page aligned > > > * allocations. > > > */ > > > if (alloc_size >= PAGE_SIZE) > > > - iotlb_align_mask |= ~PAGE_MASK; > > > - iotlb_align_mask &= ~(IO_TLB_SIZE - 1); > > > - > > > - /* > > > - * For mappings with an alignment requirement don't bother looping to > > > - * unaligned slots once we found an aligned one. > > > - */ > > > - stride = (iotlb_align_mask >> IO_TLB_SHIFT) + 1; > > > + stride = umax(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1); > > > > Is this special handling of alloc_size >= PAGE_SIZE really needed? > > I've been wondering that as well, but please note that this code (and the > comment) are in the upstream code, so I was erring in favour of keeping > that while fixing the bugs. We could have an extra patch dropping it if > we can convince ourselves that it's not adding anything, though. > > > I think the comment is somewhat inaccurate. If orig_addr is non-zero, and > > alloc_align_mask is zero, the requirement is for the alignment to match > > the DMA min_align_mask bits in orig_addr, even if the allocation is > > larger than a page. And with Patch 3 of this series, the swiotlb_alloc() > > case passes in alloc_align_mask to handle page size and larger requests. > > So it seems like this doesn't do anything useful unless orig_addr and > > alloc_align_mask are both zero, and there aren't any cases of that > > after this patch series. If the caller wants alignment, specify > > it with alloc_align_mask. > > It's an interesting observation. Presumably the intention here is to > reduce the cost of the linear search, but the code originates from a > time when we didn't have iotlb_align_mask or alloc_align_mask and so I > tend to agree that it should probably just be dropped. I'm also not even > convinced that it works properly if the initial search index ends up > being 2KiB (i.e. slot) aligned -- we'll end up jumping over the > page-aligned addresses! > > I'll add another patch to v5 which removes this check (and you've basically > written the commit message for me, so thanks). Works for me. > > > > spin_lock_irqsave(&area->lock, flags); > > > if (unlikely(nslots > pool->area_nslabs - area->used)) > > > @@ -1015,11 +1014,14 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool > > > index = area->index; > > > > > > for (slots_checked = 0; slots_checked < pool->area_nslabs; ) { > > > - slot_index = slot_base + index; > > > + phys_addr_t tlb_addr; > > > > > > - if (orig_addr && > > > - (slot_addr(tbl_dma_addr, slot_index) & > > > - iotlb_align_mask) != (orig_addr & iotlb_align_mask)) { > > > + slot_index = slot_base + index; > > > + tlb_addr = slot_addr(tbl_dma_addr, slot_index); > > > + > > > + if ((tlb_addr & alloc_align_mask) || > > > + (orig_addr && (tlb_addr & iotlb_align_mask) != > > > + (orig_addr & iotlb_align_mask))) { > > > > It looks like these changes will cause a mapping failure in some > > iommu_dma_map_page() cases that previously didn't fail. > > Hmm, it's really hard to tell. This code has been quite badly broken for > some time, so I'm not sure how far back you have to go to find a kernel > that would work properly (e.g. for Nicolin's case with 64KiB pages). > > > Everything is made right by Patch 4 of your series, but from a > > bisect standpoint, there will be a gap where things are worse. > > In [1], I think Nicolin reported a crash with just this patch applied. > > In Nicolin's case, I think it didn't work without the patch either, this > just triggered the failure earlier. > > > While the iommu_dma_map_page() case can already fail due to > > "too large" requests because of not setting a max mapping size, > > this patch can cause smaller requests to fail as well until Patch 4 > > gets applied. That might be problem to avoid, perhaps by > > merging the Patch 4 changes into this patch. > > I'll leave this up to Christoph. Personally, I'm keen to avoid having > a giant patch trying to fix all the SWIOTLB allocation issues in one go, > as it will inevitably get reverted due to a corner case that we weren't > able to test properly, breaking the common cases at the same time. > Yes, I agree there's a tradeoff against cramming all the changes into one big patch, so I'm OK with whichever approach is taken. FWIW, here is the case I'm concerned about being broken after this patch, but before Patch 4 of the series: * alloc_align_mask is 0xFFFF (e.g., due to 64K IOMMU granule) * iotlb_align_mask is 0x800 (DMA min_align_mask is 4K - 1, as for NVMe) * orig_addr is non-NULL and has bit 0x800 set In the new "if" statement, any tlb_addr that produces "false" for the left half of the "||" operator produces "true" for the right half. So the entire "if" statement always evaluates to true and the "for" loop never finds any slots that can be used. In other words, for this case there's no way for the returned swiotlb memory to be aligned to alloc_align_mask and to orig_addr (modulo DMA min_align_mask) at the same time, and the mapping fails. Michael ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 1/5] swiotlb: Fix double-allocation of slots due to broken alignment handling 2024-02-23 12:47 ` Will Deacon 2024-02-23 13:36 ` Petr Tesařík 2024-02-23 17:04 ` Michael Kelley @ 2024-02-27 15:38 ` Christoph Hellwig 2 siblings, 0 replies; 27+ messages in thread From: Christoph Hellwig @ 2024-02-27 15:38 UTC (permalink / raw) To: Will Deacon Cc: Michael Kelley, linux-kernel@vger.kernel.org, kernel-team@android.com, iommu@lists.linux.dev, Christoph Hellwig, Marek Szyprowski, Robin Murphy, Petr Tesarik, Dexuan Cui, Nicolin Chen On Fri, Feb 23, 2024 at 12:47:43PM +0000, Will Deacon wrote: > > > /* > > > * For allocations of PAGE_SIZE or larger only look for page aligned > > > * allocations. > > > */ > > > if (alloc_size >= PAGE_SIZE) > > > - iotlb_align_mask |= ~PAGE_MASK; > > > - iotlb_align_mask &= ~(IO_TLB_SIZE - 1); > > > - > > > - /* > > > - * For mappings with an alignment requirement don't bother looping to > > > - * unaligned slots once we found an aligned one. > > > - */ > > > - stride = (iotlb_align_mask >> IO_TLB_SHIFT) + 1; > > > + stride = umax(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1); > > > > Is this special handling of alloc_size >= PAGE_SIZE really needed? > > I've been wondering that as well, but please note that this code (and the > comment) are in the upstream code, so I was erring in favour of keeping > that while fixing the bugs. We could have an extra patch dropping it if > we can convince ourselves that it's not adding anything, though. This special casing goes back to before git history. It obviously is not needed, but it might have made sense back then. If people come up with a good argument I'm totally fine with removing it, but I also think we need to get the fixes here in ASAP, so things that are just cleanups probably aren't priority right now. > > While the iommu_dma_map_page() case can already fail due to > > "too large" requests because of not setting a max mapping size, > > this patch can cause smaller requests to fail as well until Patch 4 > > gets applied. That might be problem to avoid, perhaps by > > merging the Patch 4 changes into this patch. > > I'll leave this up to Christoph. Personally, I'm keen to avoid having > a giant patch trying to fix all the SWIOTLB allocation issues in one go, > as it will inevitably get reverted due to a corner case that we weren't > able to test properly, breaking the common cases at the same time. Let's keep it split. ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v4 2/5] swiotlb: Enforce page alignment in swiotlb_alloc() 2024-02-21 11:34 [PATCH v4 0/5] Fix double allocation in swiotlb_alloc() Will Deacon 2024-02-21 11:35 ` [PATCH v4 1/5] swiotlb: Fix double-allocation of slots due to broken alignment handling Will Deacon @ 2024-02-21 11:35 ` Will Deacon 2024-02-21 23:36 ` Michael Kelley 2024-02-21 11:35 ` [PATCH v4 3/5] swiotlb: Honour dma_alloc_coherent() " Will Deacon ` (3 subsequent siblings) 5 siblings, 1 reply; 27+ messages in thread From: Will Deacon @ 2024-02-21 11:35 UTC (permalink / raw) To: linux-kernel Cc: kernel-team, Will Deacon, iommu, Christoph Hellwig, Marek Szyprowski, Robin Murphy, Petr Tesarik, Dexuan Cui, Nicolin Chen, Michael Kelley When allocating pages from a restricted DMA pool in swiotlb_alloc(), the buffer address is blindly converted to a 'struct page *' that is returned to the caller. In the unlikely event of an allocation bug, page-unaligned addresses are not detected and slots can silently be double-allocated. Add a simple check of the buffer alignment in swiotlb_alloc() to make debugging a little easier if something has gone wonky. Cc: Christoph Hellwig <hch@lst.de> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Dexuan Cui <decui@microsoft.com> Reviewed-by: Petr Tesarik <petr.tesarik1@huawei-partners.com> Signed-off-by: Will Deacon <will@kernel.org> --- kernel/dma/swiotlb.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 2ec2cc81f1a2..ab7fbb40bc55 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -1643,6 +1643,12 @@ struct page *swiotlb_alloc(struct device *dev, size_t size) return NULL; tlb_addr = slot_addr(pool->start, index); + if (unlikely(!PAGE_ALIGNED(tlb_addr))) { + dev_WARN_ONCE(dev, 1, "Cannot allocate pages from non page-aligned swiotlb addr 0x%pa.\n", + &tlb_addr); + swiotlb_release_slots(dev, tlb_addr); + return NULL; + } return pfn_to_page(PFN_DOWN(tlb_addr)); } -- 2.44.0.rc0.258.g7320e95886-goog ^ permalink raw reply related [flat|nested] 27+ messages in thread
* RE: [PATCH v4 2/5] swiotlb: Enforce page alignment in swiotlb_alloc() 2024-02-21 11:35 ` [PATCH v4 2/5] swiotlb: Enforce page alignment in swiotlb_alloc() Will Deacon @ 2024-02-21 23:36 ` Michael Kelley 0 siblings, 0 replies; 27+ messages in thread From: Michael Kelley @ 2024-02-21 23:36 UTC (permalink / raw) To: Will Deacon, linux-kernel@vger.kernel.org Cc: kernel-team@android.com, iommu@lists.linux.dev, Christoph Hellwig, Marek Szyprowski, Robin Murphy, Petr Tesarik, Dexuan Cui, Nicolin Chen From: Will Deacon <will@kernel.org> Sent: Wednesday, February 21, 2024 3:35 AM > > When allocating pages from a restricted DMA pool in swiotlb_alloc(), > the buffer address is blindly converted to a 'struct page *' that is > returned to the caller. In the unlikely event of an allocation bug, > page-unaligned addresses are not detected and slots can silently be > double-allocated. > > Add a simple check of the buffer alignment in swiotlb_alloc() to make > debugging a little easier if something has gone wonky. > > Cc: Christoph Hellwig <hch@lst.de> > Cc: Marek Szyprowski <m.szyprowski@samsung.com> > Cc: Robin Murphy <robin.murphy@arm.com> > Cc: Dexuan Cui <decui@microsoft.com> > Reviewed-by: Petr Tesarik <petr.tesarik1@huawei-partners.com> > Signed-off-by: Will Deacon <will@kernel.org> > --- > kernel/dma/swiotlb.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c > index 2ec2cc81f1a2..ab7fbb40bc55 100644 > --- a/kernel/dma/swiotlb.c > +++ b/kernel/dma/swiotlb.c > @@ -1643,6 +1643,12 @@ struct page *swiotlb_alloc(struct device *dev, > size_t size) > return NULL; > > tlb_addr = slot_addr(pool->start, index); > + if (unlikely(!PAGE_ALIGNED(tlb_addr))) { > + dev_WARN_ONCE(dev, 1, "Cannot allocate pages from non page-aligned swiotlb addr 0x%pa.\n", > + &tlb_addr); > + swiotlb_release_slots(dev, tlb_addr); > + return NULL; > + } > > return pfn_to_page(PFN_DOWN(tlb_addr)); > } > -- > 2.44.0.rc0.258.g7320e95886-goog Reviewed-by: Michael Kelley <mhklinux@outlook.com> ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v4 3/5] swiotlb: Honour dma_alloc_coherent() alignment in swiotlb_alloc() 2024-02-21 11:34 [PATCH v4 0/5] Fix double allocation in swiotlb_alloc() Will Deacon 2024-02-21 11:35 ` [PATCH v4 1/5] swiotlb: Fix double-allocation of slots due to broken alignment handling Will Deacon 2024-02-21 11:35 ` [PATCH v4 2/5] swiotlb: Enforce page alignment in swiotlb_alloc() Will Deacon @ 2024-02-21 11:35 ` Will Deacon 2024-02-21 23:36 ` Michael Kelley 2024-02-21 11:35 ` [PATCH v4 4/5] swiotlb: Fix alignment checks when both allocation and DMA masks are present Will Deacon ` (2 subsequent siblings) 5 siblings, 1 reply; 27+ messages in thread From: Will Deacon @ 2024-02-21 11:35 UTC (permalink / raw) To: linux-kernel Cc: kernel-team, Will Deacon, iommu, Christoph Hellwig, Marek Szyprowski, Robin Murphy, Petr Tesarik, Dexuan Cui, Nicolin Chen, Michael Kelley core-api/dma-api-howto.rst states the following properties of dma_alloc_coherent(): | The CPU virtual address and the DMA address are both guaranteed to | be aligned to the smallest PAGE_SIZE order which is greater than or | equal to the requested size. However, swiotlb_alloc() passes zero for the 'alloc_align_mask' parameter of swiotlb_find_slots() and so this property is not upheld. Instead, allocations larger than a page are aligned to PAGE_SIZE, Calculate the mask corresponding to the page order suitable for holding the allocation and pass that to swiotlb_find_slots(). Cc: Christoph Hellwig <hch@lst.de> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Dexuan Cui <decui@microsoft.com> Fixes: e81e99bacc9f ("swiotlb: Support aligned swiotlb buffers") Reviewed-by: Petr Tesarik <petr.tesarik1@huawei-partners.com> Signed-off-by: Will Deacon <will@kernel.org> --- kernel/dma/swiotlb.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index ab7fbb40bc55..c20324fba814 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -1633,12 +1633,14 @@ struct page *swiotlb_alloc(struct device *dev, size_t size) struct io_tlb_mem *mem = dev->dma_io_tlb_mem; struct io_tlb_pool *pool; phys_addr_t tlb_addr; + unsigned int align; int index; if (!mem) return NULL; - index = swiotlb_find_slots(dev, 0, size, 0, &pool); + align = (1 << (get_order(size) + PAGE_SHIFT)) - 1; + index = swiotlb_find_slots(dev, 0, size, align, &pool); if (index == -1) return NULL; -- 2.44.0.rc0.258.g7320e95886-goog ^ permalink raw reply related [flat|nested] 27+ messages in thread
* RE: [PATCH v4 3/5] swiotlb: Honour dma_alloc_coherent() alignment in swiotlb_alloc() 2024-02-21 11:35 ` [PATCH v4 3/5] swiotlb: Honour dma_alloc_coherent() " Will Deacon @ 2024-02-21 23:36 ` Michael Kelley 0 siblings, 0 replies; 27+ messages in thread From: Michael Kelley @ 2024-02-21 23:36 UTC (permalink / raw) To: Will Deacon, linux-kernel@vger.kernel.org Cc: kernel-team@android.com, iommu@lists.linux.dev, Christoph Hellwig, Marek Szyprowski, Robin Murphy, Petr Tesarik, Dexuan Cui, Nicolin Chen From: Will Deacon <will@kernel.org> Sent: Wednesday, February 21, 2024 3:35 AM > > core-api/dma-api-howto.rst states the following properties of > dma_alloc_coherent(): > > | The CPU virtual address and the DMA address are both guaranteed to > | be aligned to the smallest PAGE_SIZE order which is greater than or > | equal to the requested size. > > However, swiotlb_alloc() passes zero for the 'alloc_align_mask' > parameter of swiotlb_find_slots() and so this property is not upheld. > Instead, allocations larger than a page are aligned to PAGE_SIZE, > > Calculate the mask corresponding to the page order suitable for holding > the allocation and pass that to swiotlb_find_slots(). > > Cc: Christoph Hellwig <hch@lst.de> > Cc: Marek Szyprowski <m.szyprowski@samsung.com> > Cc: Robin Murphy <robin.murphy@arm.com> > Cc: Dexuan Cui <decui@microsoft.com> > Fixes: e81e99bacc9f ("swiotlb: Support aligned swiotlb buffers") > Reviewed-by: Petr Tesarik <petr.tesarik1@huawei-partners.com> > Signed-off-by: Will Deacon <will@kernel.org> > --- > kernel/dma/swiotlb.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c > index ab7fbb40bc55..c20324fba814 100644 > --- a/kernel/dma/swiotlb.c > +++ b/kernel/dma/swiotlb.c > @@ -1633,12 +1633,14 @@ struct page *swiotlb_alloc(struct device *dev, > size_t size) > struct io_tlb_mem *mem = dev->dma_io_tlb_mem; > struct io_tlb_pool *pool; > phys_addr_t tlb_addr; > + unsigned int align; > int index; > > if (!mem) > return NULL; > > - index = swiotlb_find_slots(dev, 0, size, 0, &pool); > + align = (1 << (get_order(size) + PAGE_SHIFT)) - 1; > + index = swiotlb_find_slots(dev, 0, size, align, &pool); > if (index == -1) > return NULL; > > -- > 2.44.0.rc0.258.g7320e95886-goog Reviewed-by: Michael Kelley <mhklinux@outlook.com> ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v4 4/5] swiotlb: Fix alignment checks when both allocation and DMA masks are present 2024-02-21 11:34 [PATCH v4 0/5] Fix double allocation in swiotlb_alloc() Will Deacon ` (2 preceding siblings ...) 2024-02-21 11:35 ` [PATCH v4 3/5] swiotlb: Honour dma_alloc_coherent() " Will Deacon @ 2024-02-21 11:35 ` Will Deacon 2024-02-21 23:37 ` Michael Kelley 2024-02-21 11:35 ` [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device Will Deacon 2024-02-23 11:34 ` [PATCH v4 0/5] Fix double allocation in swiotlb_alloc() Nicolin Chen 5 siblings, 1 reply; 27+ messages in thread From: Will Deacon @ 2024-02-21 11:35 UTC (permalink / raw) To: linux-kernel Cc: kernel-team, Will Deacon, iommu, Christoph Hellwig, Marek Szyprowski, Robin Murphy, Petr Tesarik, Dexuan Cui, Nicolin Chen, Michael Kelley Nicolin reports that swiotlb buffer allocations fail for an NVME device behind an IOMMU using 64KiB pages. This is because we end up with a minimum allocation alignment of 64KiB (for the IOMMU to map the buffer safely) but a minimum DMA alignment mask corresponding to a 4KiB NVME page (i.e. preserving the 4KiB page offset from the original allocation). If the original address is not 4KiB-aligned, the allocation will fail because swiotlb_search_pool_area() erroneously compares these unmasked bits with the 64KiB-aligned candidate allocation. Tweak swiotlb_search_pool_area() so that the DMA alignment mask is reduced based on the required alignment of the allocation. Fixes: 82612d66d51d ("iommu: Allow the dma-iommu api to use bounce buffers") Reported-by: Nicolin Chen <nicolinc@nvidia.com> Link: https://lore.kernel.org/r/cover.1707851466.git.nicolinc@nvidia.com Signed-off-by: Will Deacon <will@kernel.org> --- kernel/dma/swiotlb.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index c20324fba814..c381a7ed718f 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -981,8 +981,7 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool dma_addr_t tbl_dma_addr = phys_to_dma_unencrypted(dev, pool->start) & boundary_mask; unsigned long max_slots = get_max_slots(boundary_mask); - unsigned int iotlb_align_mask = - dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1); + unsigned int iotlb_align_mask = dma_get_min_align_mask(dev); unsigned int nslots = nr_slots(alloc_size), stride; unsigned int offset = swiotlb_align_offset(dev, orig_addr); unsigned int index, slots_checked, count = 0, i; @@ -993,6 +992,14 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool BUG_ON(!nslots); BUG_ON(area_index >= pool->nareas); + /* + * Ensure that the allocation is at least slot-aligned and update + * 'iotlb_align_mask' to ignore bits that will be preserved when + * offsetting into the allocation. + */ + alloc_align_mask |= (IO_TLB_SIZE - 1); + iotlb_align_mask &= ~alloc_align_mask; + /* * For mappings with an alignment requirement don't bother looping to * unaligned slots once we found an aligned one. -- 2.44.0.rc0.258.g7320e95886-goog ^ permalink raw reply related [flat|nested] 27+ messages in thread
* RE: [PATCH v4 4/5] swiotlb: Fix alignment checks when both allocation and DMA masks are present 2024-02-21 11:35 ` [PATCH v4 4/5] swiotlb: Fix alignment checks when both allocation and DMA masks are present Will Deacon @ 2024-02-21 23:37 ` Michael Kelley 0 siblings, 0 replies; 27+ messages in thread From: Michael Kelley @ 2024-02-21 23:37 UTC (permalink / raw) To: Will Deacon, linux-kernel@vger.kernel.org Cc: kernel-team@android.com, iommu@lists.linux.dev, Christoph Hellwig, Marek Szyprowski, Robin Murphy, Petr Tesarik, Dexuan Cui, Nicolin Chen From: Will Deacon <will@kernel.org> Sent: Wednesday, February 21, 2024 3:35 AM > > Nicolin reports that swiotlb buffer allocations fail for an NVME device > behind an IOMMU using 64KiB pages. This is because we end up with a > minimum allocation alignment of 64KiB (for the IOMMU to map the buffer > safely) but a minimum DMA alignment mask corresponding to a 4KiB NVME > page (i.e. preserving the 4KiB page offset from the original allocation). > If the original address is not 4KiB-aligned, the allocation will fail > because swiotlb_search_pool_area() erroneously compares these unmasked > bits with the 64KiB-aligned candidate allocation. > > Tweak swiotlb_search_pool_area() so that the DMA alignment mask is > reduced based on the required alignment of the allocation. > > Fixes: 82612d66d51d ("iommu: Allow the dma-iommu api to use bounce buffers") > Reported-by: Nicolin Chen <nicolinc@nvidia.com> > Link: https://lore.kernel.org/all/cover.1707851466.git.nicolinc@nvidia.com/ > Signed-off-by: Will Deacon <will@kernel.org> > --- > kernel/dma/swiotlb.c | 11 +++++++++-- > 1 file changed, 9 insertions(+), 2 deletions(-) > > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c > index c20324fba814..c381a7ed718f 100644 > --- a/kernel/dma/swiotlb.c > +++ b/kernel/dma/swiotlb.c > @@ -981,8 +981,7 @@ static int swiotlb_search_pool_area(struct device > *dev, struct io_tlb_pool *pool > dma_addr_t tbl_dma_addr = > phys_to_dma_unencrypted(dev, pool->start) & boundary_mask; > unsigned long max_slots = get_max_slots(boundary_mask); > - unsigned int iotlb_align_mask = > - dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1); > + unsigned int iotlb_align_mask = dma_get_min_align_mask(dev); > unsigned int nslots = nr_slots(alloc_size), stride; > unsigned int offset = swiotlb_align_offset(dev, orig_addr); > unsigned int index, slots_checked, count = 0, i; > @@ -993,6 +992,14 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool > BUG_ON(!nslots); > BUG_ON(area_index >= pool->nareas); > > + /* > + * Ensure that the allocation is at least slot-aligned and update > + * 'iotlb_align_mask' to ignore bits that will be preserved when > + * offsetting into the allocation. > + */ > + alloc_align_mask |= (IO_TLB_SIZE - 1); > + iotlb_align_mask &= ~alloc_align_mask; > + > /* > * For mappings with an alignment requirement don't bother looping to > * unaligned slots once we found an aligned one. > -- > 2.44.0.rc0.258.g7320e95886-goog Reviewed-by: Michael Kelley <mhklinux@outlook.com> But see my comments in Patch 1 of the series about whether this should be folded into Patch 1. ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device 2024-02-21 11:34 [PATCH v4 0/5] Fix double allocation in swiotlb_alloc() Will Deacon ` (3 preceding siblings ...) 2024-02-21 11:35 ` [PATCH v4 4/5] swiotlb: Fix alignment checks when both allocation and DMA masks are present Will Deacon @ 2024-02-21 11:35 ` Will Deacon 2024-02-21 23:39 ` Michael Kelley ` (2 more replies) 2024-02-23 11:34 ` [PATCH v4 0/5] Fix double allocation in swiotlb_alloc() Nicolin Chen 5 siblings, 3 replies; 27+ messages in thread From: Will Deacon @ 2024-02-21 11:35 UTC (permalink / raw) To: linux-kernel Cc: kernel-team, Will Deacon, iommu, Christoph Hellwig, Marek Szyprowski, Robin Murphy, Petr Tesarik, Dexuan Cui, Nicolin Chen, Michael Kelley From: Nicolin Chen <nicolinc@nvidia.com> The swiotlb does not support a mapping size > swiotlb_max_mapping_size(). On the other hand, with a 64KB PAGE_SIZE configuration, it's observed that an NVME device can map a size between 300KB~512KB, which certainly failed the swiotlb mappings, though the default pool of swiotlb has many slots: systemd[1]: Started Journal Service. => nvme 0000:00:01.0: swiotlb buffer is full (sz: 327680 bytes), total 32768 (slots), used 32 (slots) note: journal-offline[392] exited with irqs disabled note: journal-offline[392] exited with preempt_count 1 Call trace: [ 3.099918] swiotlb_tbl_map_single+0x214/0x240 [ 3.099921] iommu_dma_map_page+0x218/0x328 [ 3.099928] dma_map_page_attrs+0x2e8/0x3a0 [ 3.101985] nvme_prep_rq.part.0+0x408/0x878 [nvme] [ 3.102308] nvme_queue_rqs+0xc0/0x300 [nvme] [ 3.102313] blk_mq_flush_plug_list.part.0+0x57c/0x600 [ 3.102321] blk_add_rq_to_plug+0x180/0x2a0 [ 3.102323] blk_mq_submit_bio+0x4c8/0x6b8 [ 3.103463] __submit_bio+0x44/0x220 [ 3.103468] submit_bio_noacct_nocheck+0x2b8/0x360 [ 3.103470] submit_bio_noacct+0x180/0x6c8 [ 3.103471] submit_bio+0x34/0x130 [ 3.103473] ext4_bio_write_folio+0x5a4/0x8c8 [ 3.104766] mpage_submit_folio+0xa0/0x100 [ 3.104769] mpage_map_and_submit_buffers+0x1a4/0x400 [ 3.104771] ext4_do_writepages+0x6a0/0xd78 [ 3.105615] ext4_writepages+0x80/0x118 [ 3.105616] do_writepages+0x90/0x1e8 [ 3.105619] filemap_fdatawrite_wbc+0x94/0xe0 [ 3.105622] __filemap_fdatawrite_range+0x68/0xb8 [ 3.106656] file_write_and_wait_range+0x84/0x120 [ 3.106658] ext4_sync_file+0x7c/0x4c0 [ 3.106660] vfs_fsync_range+0x3c/0xa8 [ 3.106663] do_fsync+0x44/0xc0 Since untrusted devices might go down the swiotlb pathway with dma-iommu, these devices should not map a size larger than swiotlb_max_mapping_size. To fix this bug, add iommu_dma_max_mapping_size() for untrusted devices to take into account swiotlb_max_mapping_size() v.s. iova_rcache_range() from the iommu_dma_opt_mapping_size(). Fixes: 82612d66d51d ("iommu: Allow the dma-iommu api to use bounce buffers") Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Link: https://lore.kernel.org/r/ee51a3a5c32cf885b18f6416171802669f4a718a.1707851466.git.nicolinc@nvidia.com Signed-off-by: Will Deacon <will@kernel.org> --- drivers/iommu/dma-iommu.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 50ccc4f1ef81..7d1a20da6d94 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -1706,6 +1706,13 @@ static size_t iommu_dma_opt_mapping_size(void) return iova_rcache_range(); } +static size_t iommu_dma_max_mapping_size(struct device *dev) +{ + if (is_swiotlb_active(dev) && dev_is_untrusted(dev)) + return swiotlb_max_mapping_size(dev); + return SIZE_MAX; +} + static const struct dma_map_ops iommu_dma_ops = { .flags = DMA_F_PCI_P2PDMA_SUPPORTED, .alloc = iommu_dma_alloc, @@ -1728,6 +1735,7 @@ static const struct dma_map_ops iommu_dma_ops = { .unmap_resource = iommu_dma_unmap_resource, .get_merge_boundary = iommu_dma_get_merge_boundary, .opt_mapping_size = iommu_dma_opt_mapping_size, + .max_mapping_size = iommu_dma_max_mapping_size, }; /* -- 2.44.0.rc0.258.g7320e95886-goog ^ permalink raw reply related [flat|nested] 27+ messages in thread
* RE: [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device 2024-02-21 11:35 ` [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device Will Deacon @ 2024-02-21 23:39 ` Michael Kelley 2024-02-23 19:58 ` Nicolin Chen 2024-02-26 19:35 ` Robin Murphy 2024-02-27 15:40 ` Christoph Hellwig 2 siblings, 1 reply; 27+ messages in thread From: Michael Kelley @ 2024-02-21 23:39 UTC (permalink / raw) To: Will Deacon, linux-kernel@vger.kernel.org Cc: kernel-team@android.com, iommu@lists.linux.dev, Christoph Hellwig, Marek Szyprowski, Robin Murphy, Petr Tesarik, Dexuan Cui, Nicolin Chen From: Will Deacon <will@kernel.org> Sent: Wednesday, February 21, 2024 3:35 AM > > The swiotlb does not support a mapping size > swiotlb_max_mapping_size(). > On the other hand, with a 64KB PAGE_SIZE configuration, it's observed that > an NVME device can map a size between 300KB~512KB, which certainly failed > the swiotlb mappings, though the default pool of swiotlb has many slots: > systemd[1]: Started Journal Service. > => nvme 0000:00:01.0: swiotlb buffer is full (sz: 327680 bytes), total 32768 > (slots), used 32 (slots) > note: journal-offline[392] exited with irqs disabled > note: journal-offline[392] exited with preempt_count 1 > > Call trace: > [ 3.099918] swiotlb_tbl_map_single+0x214/0x240 > [ 3.099921] iommu_dma_map_page+0x218/0x328 > [ 3.099928] dma_map_page_attrs+0x2e8/0x3a0 > [ 3.101985] nvme_prep_rq.part.0+0x408/0x878 [nvme] > [ 3.102308] nvme_queue_rqs+0xc0/0x300 [nvme] > [ 3.102313] blk_mq_flush_plug_list.part.0+0x57c/0x600 > [ 3.102321] blk_add_rq_to_plug+0x180/0x2a0 > [ 3.102323] blk_mq_submit_bio+0x4c8/0x6b8 > [ 3.103463] __submit_bio+0x44/0x220 > [ 3.103468] submit_bio_noacct_nocheck+0x2b8/0x360 > [ 3.103470] submit_bio_noacct+0x180/0x6c8 > [ 3.103471] submit_bio+0x34/0x130 > [ 3.103473] ext4_bio_write_folio+0x5a4/0x8c8 > [ 3.104766] mpage_submit_folio+0xa0/0x100 > [ 3.104769] mpage_map_and_submit_buffers+0x1a4/0x400 > [ 3.104771] ext4_do_writepages+0x6a0/0xd78 > [ 3.105615] ext4_writepages+0x80/0x118 > [ 3.105616] do_writepages+0x90/0x1e8 > [ 3.105619] filemap_fdatawrite_wbc+0x94/0xe0 > [ 3.105622] __filemap_fdatawrite_range+0x68/0xb8 > [ 3.106656] file_write_and_wait_range+0x84/0x120 > [ 3.106658] ext4_sync_file+0x7c/0x4c0 > [ 3.106660] vfs_fsync_range+0x3c/0xa8 > [ 3.106663] do_fsync+0x44/0xc0 > > Since untrusted devices might go down the swiotlb pathway with dma-iommu, > these devices should not map a size larger than swiotlb_max_mapping_size. > > To fix this bug, add iommu_dma_max_mapping_size() for untrusted devices to > take into account swiotlb_max_mapping_size() v.s. iova_rcache_range() from > the iommu_dma_opt_mapping_size(). > > Fixes: 82612d66d51d ("iommu: Allow the dma-iommu api to use bounce buffers") > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> > Link: https://lore.kernel.org/all/ee51a3a5c32cf885b18f6416171802669f4a718a.1707851466.git.nicolinc@nvidia.com/ > Signed-off-by: Will Deacon <will@kernel.org> > --- > drivers/iommu/dma-iommu.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c > index 50ccc4f1ef81..7d1a20da6d94 100644 > --- a/drivers/iommu/dma-iommu.c > +++ b/drivers/iommu/dma-iommu.c > @@ -1706,6 +1706,13 @@ static size_t > iommu_dma_opt_mapping_size(void) > return iova_rcache_range(); > } > > +static size_t iommu_dma_max_mapping_size(struct device *dev) > +{ > + if (is_swiotlb_active(dev) && dev_is_untrusted(dev)) > + return swiotlb_max_mapping_size(dev); > + return SIZE_MAX; > +} > + In this [1] email, Nicolin had a version of this function that incorporated the IOMMU granule. For granules bigger than 4K, I think that's needed so that when IOMMU code sets the swiotlb alloc_align_mask to the IOMMU granule - 1, the larger offset plus the size won't exceed the max number of slots. swiotlb_max_mapping_size() by itself may return a value that's too big when alloc_align_mask is used. Michael [1] https://lore.kernel.org/linux-iommu/SN6PR02MB415727E61B5295C259CCB268D4512@SN6PR02MB4157.namprd02.prod.outlook.com/T/#m14dd359c5a4dd13e6cb0c35cf94f8d746257ae48 > static const struct dma_map_ops iommu_dma_ops = { > .flags = DMA_F_PCI_P2PDMA_SUPPORTED, > .alloc = iommu_dma_alloc, > @@ -1728,6 +1735,7 @@ static const struct dma_map_ops iommu_dma_ops = { > .unmap_resource = iommu_dma_unmap_resource, > .get_merge_boundary = iommu_dma_get_merge_boundary, > .opt_mapping_size = iommu_dma_opt_mapping_size, > + .max_mapping_size = iommu_dma_max_mapping_size, > }; > > /* > -- > 2.44.0.rc0.258.g7320e95886-goog ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device 2024-02-21 23:39 ` Michael Kelley @ 2024-02-23 19:58 ` Nicolin Chen 2024-02-23 21:10 ` Michael Kelley 0 siblings, 1 reply; 27+ messages in thread From: Nicolin Chen @ 2024-02-23 19:58 UTC (permalink / raw) To: Will Deacon, Michael Kelley Cc: linux-kernel@vger.kernel.org, kernel-team@android.com, iommu@lists.linux.dev, Christoph Hellwig, Marek Szyprowski, Robin Murphy, Petr Tesarik, Dexuan Cui On Wed, Feb 21, 2024 at 11:39:29PM +0000, Michael Kelley wrote: > From: Will Deacon <will@kernel.org> Sent: Wednesday, February 21, 2024 3:35 AM > > +static size_t iommu_dma_max_mapping_size(struct device *dev) > > +{ > > + if (is_swiotlb_active(dev) && dev_is_untrusted(dev)) > > + return swiotlb_max_mapping_size(dev); > > + return SIZE_MAX; > > +} > > + > > In this [1] email, Nicolin had a version of this function that incorporated > the IOMMU granule. For granules bigger than 4K, I think that's needed > so that when IOMMU code sets the swiotlb alloc_align_mask to the > IOMMU granule - 1, the larger offset plus the size won't exceed the > max number of slots. swiotlb_max_mapping_size() by itself may > return a value that's too big when alloc_align_mask is used. > > Michael > > [1] https://lore.kernel.org/linux-iommu/SN6PR02MB415727E61B5295C259CCB268D4512@SN6PR02MB4157.namprd02.prod.outlook.com/T/#m14dd359c5a4dd13e6cb0c35cf94f8d746257ae48 Yea, I just confirmed that with 64KB PAGE_SIZE the alloc_size can be over swiotlb_max_mapping_size, i.e. 256KB > 252KB. Yet, the max size corresponding to the max number of slots should be 256KB. So, I feel that this is marginally safe? With that being said, there seems to be a 4KB size waste, due to aligning with the iommu_domain granule, in this particular alloc_size=256KB case? On the other hand, if swiotlb_max_mapping_size was subtracted by 64KB (granule=64KB), i.e. alloc_size=192KB, which seems to generate more swiotlb fragments? Thanks Nicolin ^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device 2024-02-23 19:58 ` Nicolin Chen @ 2024-02-23 21:10 ` Michael Kelley 2024-02-25 21:17 ` Michael Kelley 0 siblings, 1 reply; 27+ messages in thread From: Michael Kelley @ 2024-02-23 21:10 UTC (permalink / raw) To: Nicolin Chen, Will Deacon Cc: linux-kernel@vger.kernel.org, kernel-team@android.com, iommu@lists.linux.dev, Christoph Hellwig, Marek Szyprowski, Robin Murphy, Petr Tesarik, Dexuan Cui From: Nicolin Chen <nicolinc@nvidia.com> Sent: Friday, February 23, 2024 11:58 AM > > On Wed, Feb 21, 2024 at 11:39:29PM +0000, Michael Kelley wrote: > > From: Will Deacon <will@kernel.org> Sent: Wednesday, February 21, 2024 > 3:35 AM > > > +static size_t iommu_dma_max_mapping_size(struct device *dev) > > > +{ > > > + if (is_swiotlb_active(dev) && dev_is_untrusted(dev)) > > > + return swiotlb_max_mapping_size(dev); > > > + return SIZE_MAX; > > > +} > > > + > > > > In this [1] email, Nicolin had a version of this function that incorporated > > the IOMMU granule. For granules bigger than 4K, I think that's needed > > so that when IOMMU code sets the swiotlb alloc_align_mask to the > > IOMMU granule - 1, the larger offset plus the size won't exceed the > > max number of slots. swiotlb_max_mapping_size() by itself may > > return a value that's too big when alloc_align_mask is used. > > > > Michael > > > > [1] https://lore.kernel.org/linux-iommu/SN6PR02MB415727E61B5295C259CCB268D4512@SN6PR02MB4157.namprd02.prod.outlook.com/T/#m14dd359c5a4dd13e6cb0c35cf94f8d746257ae48 > > Yea, I just confirmed that with 64KB PAGE_SIZE the alloc_size > can be over swiotlb_max_mapping_size, i.e. 256KB > 252KB. Yet, > the max size corresponding to the max number of slots should > be 256KB. So, I feel that this is marginally safe? Yes, in the specific case you tested. But I don't think the general case is safe. In your specific case, the "size" argument to iommu_dma_map_page() is presumably 252 Kbytes because that's what your new iommu_dma_max_mapping_size() returns. iommu_dma_map_page() calls iova_align() to round up the 252K to 256K. Also in your specific case, the "offset" argument to iommu_dma_map_page() is 0, so the "phys" variable (which embeds the offset) passed to swiotlb_tbl_map_single() is aligned on a 64 Kbyte page boundary. swiotlb_tbl_map_single() then computes an offset in the orig_addr (i.e., "phys") based on the DMA min_align_mask (4 Kbytes), and that value is 0 in your specific case. swiotlb_tbl_map_single() then calls swiotlb_find_slots() specifying a size that is 256K bytes plus an offset of 0, so everything works. But what if the original offset passed to iommu_dma_map_page() is 3 Kbytes (for example)? swiotlb_tbl_map_single() will have an orig_addr that is offset from a page boundary by 3 Kbytes. The value passed to swiotlb_find_slots() will be 256 Kbytes plus an offset of 3 Kbytes, which is too big. > > With that being said, there seems to be a 4KB size waste, due > to aligning with the iommu_domain granule, in this particular > alloc_size=256KB case? > > On the other hand, if swiotlb_max_mapping_size was subtracted > by 64KB (granule=64KB), i.e. alloc_size=192KB, which seems to > generate more swiotlb fragments? dma_max_mapping_size() returns a fixed value for a particular device, so the fixed value must be conversative enough to handle dma_map_page() calls with a non-zero offset (or the similar dma_map_sg() where the scatter/gather list has non-zero offsets). So yes, the higher layers must limit I/O size, which can produce more separate I/Os and swiotlb fragments to fulfill an original request that is 256 Kbytes or larger. The same thing happens with 4K page size in that a 256K I/O gets broken into a 252K I/O and a 4K I/O if the swiotlb is being used. I'm trying to think through if there's a way to make things work with iommu_max_mapping_size() being less conversative than subtracting the full granule size. I'm doubtful that there is. Michael ^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device 2024-02-23 21:10 ` Michael Kelley @ 2024-02-25 21:17 ` Michael Kelley 0 siblings, 0 replies; 27+ messages in thread From: Michael Kelley @ 2024-02-25 21:17 UTC (permalink / raw) To: Nicolin Chen, Will Deacon Cc: linux-kernel@vger.kernel.org, kernel-team@android.com, iommu@lists.linux.dev, Christoph Hellwig, Marek Szyprowski, Robin Murphy, Petr Tesarik, Dexuan Cui From: Michael Kelley <mhklinux@outlook.com> Sent: Friday, February 23, 2024 1:11 PM > > From: Nicolin Chen <nicolinc@nvidia.com> Sent: Friday, February 23, 2024 11:58 AM > > > > On Wed, Feb 21, 2024 at 11:39:29PM +0000, Michael Kelley wrote: > > > From: Will Deacon <will@kernel.org> Sent: Wednesday, February 21, 2024 3:35 AM > > > > +static size_t iommu_dma_max_mapping_size(struct device *dev) > > > > +{ > > > > + if (is_swiotlb_active(dev) && dev_is_untrusted(dev)) > > > > + return swiotlb_max_mapping_size(dev); > > > > + return SIZE_MAX; > > > > +} > > > > + > > > > > > In this [1] email, Nicolin had a version of this function that incorporated > > > the IOMMU granule. For granules bigger than 4K, I think that's needed > > > so that when IOMMU code sets the swiotlb alloc_align_mask to the > > > IOMMU granule - 1, the larger offset plus the size won't exceed the > > > max number of slots. swiotlb_max_mapping_size() by itself may > > > return a value that's too big when alloc_align_mask is used. > > > > > > Michael > > > > > > [1] https://lore.kernel.org/linux-iommu/SN6PR02MB415727E61B5295C259CCB268D4512@SN6PR02MB4157.namprd02.prod.outlook.com/T/#m14dd359c5a4dd13e6cb0c35cf94f8d746257ae48 > > > > Yea, I just confirmed that with 64KB PAGE_SIZE the alloc_size > > can be over swiotlb_max_mapping_size, i.e. 256KB > 252KB. Yet, > > the max size corresponding to the max number of slots should > > be 256KB. So, I feel that this is marginally safe? > > Yes, in the specific case you tested. But I don't think the general > case is safe. In your specific case, the "size" argument to > iommu_dma_map_page() is presumably 252 Kbytes because that's > what your new iommu_dma_max_mapping_size() returns. > iommu_dma_map_page() calls iova_align() to round up the 252K > to 256K. Also in your specific case, the "offset" argument to > iommu_dma_map_page() is 0, so the "phys" variable (which embeds > the offset) passed to swiotlb_tbl_map_single() is aligned on a > 64 Kbyte page boundary. swiotlb_tbl_map_single() then > computes an offset in the orig_addr (i.e., "phys") based on the > DMA min_align_mask (4 Kbytes), and that value is 0 in your specific > case. swiotlb_tbl_map_single() then calls swiotlb_find_slots() > specifying a size that is 256K bytes plus an offset of 0, so everything > works. > > But what if the original offset passed to iommu_dma_map_page() > is 3 Kbytes (for example)? swiotlb_tbl_map_single() will have an > orig_addr that is offset from a page boundary by 3 Kbytes. The > value passed to swiotlb_find_slots() will be 256 Kbytes plus an > offset of 3 Kbytes, which is too big. > > > > > With that being said, there seems to be a 4KB size waste, due > > to aligning with the iommu_domain granule, in this particular > > alloc_size=256KB case? > > > > On the other hand, if swiotlb_max_mapping_size was subtracted > > by 64KB (granule=64KB), i.e. alloc_size=192KB, which seems to > > generate more swiotlb fragments? > > dma_max_mapping_size() returns a fixed value for a particular > device, so the fixed value must be conversative enough to handle > dma_map_page() calls with a non-zero offset (or the similar > dma_map_sg() where the scatter/gather list has non-zero > offsets). So yes, the higher layers must limit I/O size, which > can produce more separate I/Os and swiotlb fragments to > fulfill an original request that is 256 Kbytes or larger. The > same thing happens with 4K page size in that a 256K I/O > gets broken into a 252K I/O and a 4K I/O if the swiotlb is > being used. > > I'm trying to think through if there's a way to make things > work with iommu_max_mapping_size() being less > conversative than subtracting the full granule size. I'm > doubtful that there is. > With the current interface between iommu_dma_map_page() and swiotlb_tbl_map_single(), iommu_max_mapping_size() really has no choice but to be more conservative and subtract the full granule size. The underlying problem is how iommu_dma_map_page() and swiotlb_tbl_map_single() interact. iommu_dma_map_page() rounds up the size to a granule boundary, and then swiotlb_tbl_map_single() adds the offset modulo DMA min_align_mask. These steps should be done in the opposite order -- first add the offset then do the rounding up. With that opposite order, iommu_max_mapping_size() could be less conservative and be based solely on min_align_mask. I could see using the following approach, where alignment and rounding up are both done in swiotlb_tbl_map_single(), based on the alloc_align_mask parameter. Conceptually it makes sense to align the start and the end of the allocated buffer in the same function rather than splitting them. 1. Remove the alloc_size parameter from swiotlb_tbl_map_single(). Fixup two other call sites, where alloc_size is redundant anyway. 2. swiotlb_tbl_map_single() calculates its own local alloc_size as map_size + offset, rounded up based on alloc_align_mask. If alloc_align_mask is zero, this rounding up is a no-op, and alloc_size equals map_size + offset. Pass this local alloc_size to swiotlb_find_slots. 3. iommu_dma_map_page() removes the iova_align() call just prior to invoking swiotlb_tbl_map_single(). Looking at the code, there are also problems in iommu_dma_map_page() in zero'ing out portions of the allocated swiotlb buffer. Evidently the zero'ing is done to avoid giving an untrusted device DMA access to kernel memory areas that could contain random information that's sensitive. But I see these issues in the code: 1. The zero'ing doesn't cover any "padding" at the beginning of the swiotlb buffer due to min_align_mask offsetting. The cover letter for the last set of changes in this area explicitly says that such padding isn't zero'ed [1] but I can't reconcile why. The swiotlb *does* allocate padding slots at the beginning, even in the 5.16 kernel when those commits first went in. Maybe I'm missing something. 2. With commit 901c7280ca0d, swiotlb_tbl_map_single() always calls swiotlb_bounce(), which initializes the mapped portion regardless of the DMA xfer direction. But then in the case of DMA_FROM_DEVICE (i.e., read I/Os) iommu_dma_map_page() will zero what swiotlb_bounce() already initialized. The discussion that led to commit 901c7280ca0d concluded that the buffer must be initialized to the ultimate target buffer contents so that partial reads don't change data that wasn't read. This zero'ing violates that and should be removed (unless there's some unique handling of untrusted devices I'm not aware of). 3. If the swiotlb path is taken for reasons other than an untrusted device, iommu_dma_map_page() still does the zero'ing. That doesn't break anything, but it seems wasteful as there could be a lot of padding area that is unnecessarily zero'ed in the dma_kmalloc_needs_bounce() case, especially if the IOMMU granule is 64K. Net, it seems like some additional work beyond Patch 5 of Will's series is needed to get Nicolin's path working robustly. I could help with coding, but I can't test IOMMU paths as I only have access to VMs (x64 and arm64). Michael [1] https://lore.kernel.org/all/20210929023300.335969-1-stevensd@google.com/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device 2024-02-21 11:35 ` [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device Will Deacon 2024-02-21 23:39 ` Michael Kelley @ 2024-02-26 19:35 ` Robin Murphy 2024-02-26 21:11 ` Michael Kelley 2024-02-27 15:40 ` Christoph Hellwig 2 siblings, 1 reply; 27+ messages in thread From: Robin Murphy @ 2024-02-26 19:35 UTC (permalink / raw) To: Will Deacon, linux-kernel Cc: kernel-team, iommu, Christoph Hellwig, Marek Szyprowski, Petr Tesarik, Dexuan Cui, Nicolin Chen, Michael Kelley On 21/02/2024 11:35 am, Will Deacon wrote: > From: Nicolin Chen <nicolinc@nvidia.com> > > The swiotlb does not support a mapping size > swiotlb_max_mapping_size(). > On the other hand, with a 64KB PAGE_SIZE configuration, it's observed that > an NVME device can map a size between 300KB~512KB, which certainly failed > the swiotlb mappings, though the default pool of swiotlb has many slots: > systemd[1]: Started Journal Service. > => nvme 0000:00:01.0: swiotlb buffer is full (sz: 327680 bytes), total 32768 (slots), used 32 (slots) > note: journal-offline[392] exited with irqs disabled > note: journal-offline[392] exited with preempt_count 1 > > Call trace: > [ 3.099918] swiotlb_tbl_map_single+0x214/0x240 > [ 3.099921] iommu_dma_map_page+0x218/0x328 > [ 3.099928] dma_map_page_attrs+0x2e8/0x3a0 > [ 3.101985] nvme_prep_rq.part.0+0x408/0x878 [nvme] > [ 3.102308] nvme_queue_rqs+0xc0/0x300 [nvme] > [ 3.102313] blk_mq_flush_plug_list.part.0+0x57c/0x600 > [ 3.102321] blk_add_rq_to_plug+0x180/0x2a0 > [ 3.102323] blk_mq_submit_bio+0x4c8/0x6b8 > [ 3.103463] __submit_bio+0x44/0x220 > [ 3.103468] submit_bio_noacct_nocheck+0x2b8/0x360 > [ 3.103470] submit_bio_noacct+0x180/0x6c8 > [ 3.103471] submit_bio+0x34/0x130 > [ 3.103473] ext4_bio_write_folio+0x5a4/0x8c8 > [ 3.104766] mpage_submit_folio+0xa0/0x100 > [ 3.104769] mpage_map_and_submit_buffers+0x1a4/0x400 > [ 3.104771] ext4_do_writepages+0x6a0/0xd78 > [ 3.105615] ext4_writepages+0x80/0x118 > [ 3.105616] do_writepages+0x90/0x1e8 > [ 3.105619] filemap_fdatawrite_wbc+0x94/0xe0 > [ 3.105622] __filemap_fdatawrite_range+0x68/0xb8 > [ 3.106656] file_write_and_wait_range+0x84/0x120 > [ 3.106658] ext4_sync_file+0x7c/0x4c0 > [ 3.106660] vfs_fsync_range+0x3c/0xa8 > [ 3.106663] do_fsync+0x44/0xc0 > > Since untrusted devices might go down the swiotlb pathway with dma-iommu, > these devices should not map a size larger than swiotlb_max_mapping_size. > > To fix this bug, add iommu_dma_max_mapping_size() for untrusted devices to > take into account swiotlb_max_mapping_size() v.s. iova_rcache_range() from > the iommu_dma_opt_mapping_size(). On the basis that this is at least far closer to correct than doing nothing, Acked-by: Robin Murphy <robin.murphy@arm.com> TBH I'm scared to think about theoretical correctness for all the interactions between the IOVA granule and min_align_mask, since just the SWIOTLB stuff is bad enough, even before you realise the ways that the IOVA allocation isn't necessarily right either. However I reckon as long as we don't ever see a granule smaller than IO_TLB_SIZE, and/or a min_align_mask larger than a granule, then this should probably work well enough as-is. Cheers, Robin. > Fixes: 82612d66d51d ("iommu: Allow the dma-iommu api to use bounce buffers") > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> > Link: https://lore.kernel.org/r/ee51a3a5c32cf885b18f6416171802669f4a718a.1707851466.git.nicolinc@nvidia.com > Signed-off-by: Will Deacon <will@kernel.org> > --- > drivers/iommu/dma-iommu.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c > index 50ccc4f1ef81..7d1a20da6d94 100644 > --- a/drivers/iommu/dma-iommu.c > +++ b/drivers/iommu/dma-iommu.c > @@ -1706,6 +1706,13 @@ static size_t iommu_dma_opt_mapping_size(void) > return iova_rcache_range(); > } > > +static size_t iommu_dma_max_mapping_size(struct device *dev) > +{ > + if (is_swiotlb_active(dev) && dev_is_untrusted(dev)) > + return swiotlb_max_mapping_size(dev); > + return SIZE_MAX; > +} > + > static const struct dma_map_ops iommu_dma_ops = { > .flags = DMA_F_PCI_P2PDMA_SUPPORTED, > .alloc = iommu_dma_alloc, > @@ -1728,6 +1735,7 @@ static const struct dma_map_ops iommu_dma_ops = { > .unmap_resource = iommu_dma_unmap_resource, > .get_merge_boundary = iommu_dma_get_merge_boundary, > .opt_mapping_size = iommu_dma_opt_mapping_size, > + .max_mapping_size = iommu_dma_max_mapping_size, > }; > > /* ^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device 2024-02-26 19:35 ` Robin Murphy @ 2024-02-26 21:11 ` Michael Kelley 2024-02-27 13:22 ` Robin Murphy 0 siblings, 1 reply; 27+ messages in thread From: Michael Kelley @ 2024-02-26 21:11 UTC (permalink / raw) To: Robin Murphy, Will Deacon, linux-kernel@vger.kernel.org Cc: kernel-team@android.com, iommu@lists.linux.dev, Christoph Hellwig, Marek Szyprowski, Petr Tesarik, Dexuan Cui, Nicolin Chen From: Robin Murphy <robin.murphy@arm.com> Sent: Monday, February 26, 2024 11:36 AM > > On 21/02/2024 11:35 am, Will Deacon wrote: > > From: Nicolin Chen <nicolinc@nvidia.com> > > > > The swiotlb does not support a mapping size > swiotlb_max_mapping_size(). > > On the other hand, with a 64KB PAGE_SIZE configuration, it's observed that > > an NVME device can map a size between 300KB~512KB, which certainly failed > > the swiotlb mappings, though the default pool of swiotlb has many slots: > > systemd[1]: Started Journal Service. > > => nvme 0000:00:01.0: swiotlb buffer is full (sz: 327680 bytes), total 32768 (slots), used 32 (slots) > > note: journal-offline[392] exited with irqs disabled > > note: journal-offline[392] exited with preempt_count 1 > > > > Call trace: > > [ 3.099918] swiotlb_tbl_map_single+0x214/0x240 > > [ 3.099921] iommu_dma_map_page+0x218/0x328 > > [ 3.099928] dma_map_page_attrs+0x2e8/0x3a0 > > [ 3.101985] nvme_prep_rq.part.0+0x408/0x878 [nvme] > > [ 3.102308] nvme_queue_rqs+0xc0/0x300 [nvme] > > [ 3.102313] blk_mq_flush_plug_list.part.0+0x57c/0x600 > > [ 3.102321] blk_add_rq_to_plug+0x180/0x2a0 > > [ 3.102323] blk_mq_submit_bio+0x4c8/0x6b8 > > [ 3.103463] __submit_bio+0x44/0x220 > > [ 3.103468] submit_bio_noacct_nocheck+0x2b8/0x360 > > [ 3.103470] submit_bio_noacct+0x180/0x6c8 > > [ 3.103471] submit_bio+0x34/0x130 > > [ 3.103473] ext4_bio_write_folio+0x5a4/0x8c8 > > [ 3.104766] mpage_submit_folio+0xa0/0x100 > > [ 3.104769] mpage_map_and_submit_buffers+0x1a4/0x400 > > [ 3.104771] ext4_do_writepages+0x6a0/0xd78 > > [ 3.105615] ext4_writepages+0x80/0x118 > > [ 3.105616] do_writepages+0x90/0x1e8 > > [ 3.105619] filemap_fdatawrite_wbc+0x94/0xe0 > > [ 3.105622] __filemap_fdatawrite_range+0x68/0xb8 > > [ 3.106656] file_write_and_wait_range+0x84/0x120 > > [ 3.106658] ext4_sync_file+0x7c/0x4c0 > > [ 3.106660] vfs_fsync_range+0x3c/0xa8 > > [ 3.106663] do_fsync+0x44/0xc0 > > > > Since untrusted devices might go down the swiotlb pathway with dma-iommu, > > these devices should not map a size larger than swiotlb_max_mapping_size. > > > > To fix this bug, add iommu_dma_max_mapping_size() for untrusted devices to > > take into account swiotlb_max_mapping_size() v.s. iova_rcache_range() from > > the iommu_dma_opt_mapping_size(). > > On the basis that this is at least far closer to correct than doing nothing, > > Acked-by: Robin Murphy <robin.murphy@arm.com> > > TBH I'm scared to think about theoretical correctness for all the > interactions between the IOVA granule and min_align_mask, since just the > SWIOTLB stuff is bad enough, even before you realise the ways that the > IOVA allocation isn't necessarily right either. However I reckon as long > as we don't ever see a granule smaller than IO_TLB_SIZE, and/or a > min_align_mask larger than a granule, then this should probably work > well enough as-is. > I'm not convinced. The conditions you describe are reasonable and reflect upstream code today. But there can still be a failure due to attempting to allocate a "too large" swiotlb buffer. It happens with a granule of 64K and min_align_mask of 4K - 1 (the NVMe case) when the offset passed to iommu_dma_map_page() is non-zero modulo 4K. With this patch, the size passed into iommu_dma_map_page() is limited to 252K, but it gets rounded up to 256K. Then swiotlb_tbl_map_single() adds the offset modulo 4K. The result exceeds the 256K limit in swiotlb and the mapping fails. The case I describe is a reasonable case that happens in the real world. As is, this patch will work most of the time because for large xfers the offset is typically at least 4K aligned. But not always. Michael ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device 2024-02-26 21:11 ` Michael Kelley @ 2024-02-27 13:22 ` Robin Murphy 2024-02-27 14:30 ` Michael Kelley 0 siblings, 1 reply; 27+ messages in thread From: Robin Murphy @ 2024-02-27 13:22 UTC (permalink / raw) To: Michael Kelley, Will Deacon, linux-kernel@vger.kernel.org Cc: kernel-team@android.com, iommu@lists.linux.dev, Christoph Hellwig, Marek Szyprowski, Petr Tesarik, Dexuan Cui, Nicolin Chen On 26/02/2024 9:11 pm, Michael Kelley wrote: > From: Robin Murphy <robin.murphy@arm.com> Sent: Monday, February 26, 2024 11:36 AM >> >> On 21/02/2024 11:35 am, Will Deacon wrote: >>> From: Nicolin Chen <nicolinc@nvidia.com> >>> >>> The swiotlb does not support a mapping size > swiotlb_max_mapping_size(). >>> On the other hand, with a 64KB PAGE_SIZE configuration, it's observed that >>> an NVME device can map a size between 300KB~512KB, which certainly failed >>> the swiotlb mappings, though the default pool of swiotlb has many slots: >>> systemd[1]: Started Journal Service. >>> => nvme 0000:00:01.0: swiotlb buffer is full (sz: 327680 bytes), total 32768 (slots), used 32 (slots) >>> note: journal-offline[392] exited with irqs disabled >>> note: journal-offline[392] exited with preempt_count 1 >>> >>> Call trace: >>> [ 3.099918] swiotlb_tbl_map_single+0x214/0x240 >>> [ 3.099921] iommu_dma_map_page+0x218/0x328 >>> [ 3.099928] dma_map_page_attrs+0x2e8/0x3a0 >>> [ 3.101985] nvme_prep_rq.part.0+0x408/0x878 [nvme] >>> [ 3.102308] nvme_queue_rqs+0xc0/0x300 [nvme] >>> [ 3.102313] blk_mq_flush_plug_list.part.0+0x57c/0x600 >>> [ 3.102321] blk_add_rq_to_plug+0x180/0x2a0 >>> [ 3.102323] blk_mq_submit_bio+0x4c8/0x6b8 >>> [ 3.103463] __submit_bio+0x44/0x220 >>> [ 3.103468] submit_bio_noacct_nocheck+0x2b8/0x360 >>> [ 3.103470] submit_bio_noacct+0x180/0x6c8 >>> [ 3.103471] submit_bio+0x34/0x130 >>> [ 3.103473] ext4_bio_write_folio+0x5a4/0x8c8 >>> [ 3.104766] mpage_submit_folio+0xa0/0x100 >>> [ 3.104769] mpage_map_and_submit_buffers+0x1a4/0x400 >>> [ 3.104771] ext4_do_writepages+0x6a0/0xd78 >>> [ 3.105615] ext4_writepages+0x80/0x118 >>> [ 3.105616] do_writepages+0x90/0x1e8 >>> [ 3.105619] filemap_fdatawrite_wbc+0x94/0xe0 >>> [ 3.105622] __filemap_fdatawrite_range+0x68/0xb8 >>> [ 3.106656] file_write_and_wait_range+0x84/0x120 >>> [ 3.106658] ext4_sync_file+0x7c/0x4c0 >>> [ 3.106660] vfs_fsync_range+0x3c/0xa8 >>> [ 3.106663] do_fsync+0x44/0xc0 >>> >>> Since untrusted devices might go down the swiotlb pathway with dma-iommu, >>> these devices should not map a size larger than swiotlb_max_mapping_size. >>> >>> To fix this bug, add iommu_dma_max_mapping_size() for untrusted devices to >>> take into account swiotlb_max_mapping_size() v.s. iova_rcache_range() from >>> the iommu_dma_opt_mapping_size(). >> >> On the basis that this is at least far closer to correct than doing nothing, >> >> Acked-by: Robin Murphy <robin.murphy@arm.com> >> >> TBH I'm scared to think about theoretical correctness for all the >> interactions between the IOVA granule and min_align_mask, since just the >> SWIOTLB stuff is bad enough, even before you realise the ways that the >> IOVA allocation isn't necessarily right either. However I reckon as long >> as we don't ever see a granule smaller than IO_TLB_SIZE, and/or a >> min_align_mask larger than a granule, then this should probably work >> well enough as-is. >> > > I'm not convinced. The conditions you describe are reasonable > and reflect upstream code today. But there can still be a failure > due to attempting to allocate a "too large" swiotlb buffer. It > happens with a granule of 64K and min_align_mask of 4K - 1 (the > NVMe case) when the offset passed to iommu_dma_map_page() > is non-zero modulo 4K. With this patch, the size passed into > iommu_dma_map_page() is limited to 252K, but it gets rounded > up to 256K. Then swiotlb_tbl_map_single() adds the offset > modulo 4K. The result exceeds the 256K limit in swiotlb and > the mapping fails. > > The case I describe is a reasonable case that happens in the real > world. As is, this patch will work most of the time because for > large xfers the offset is typically at least 4K aligned. But not always. Indeed that's what my "probably [...] well enough" meant to imply. I think there's proving to be sufficient complexity here that if it turns out we do manage to still hit real-world issues with the coarse approximation, that will be the point when any further effort would be better spent on finally tackling the thing that's always been on the to-do list, where we'd only bounce the unaligned start and/or end granules while mapping the rest of the buffer in place. Cheers, Robin. ^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device 2024-02-27 13:22 ` Robin Murphy @ 2024-02-27 14:30 ` Michael Kelley 0 siblings, 0 replies; 27+ messages in thread From: Michael Kelley @ 2024-02-27 14:30 UTC (permalink / raw) To: Robin Murphy, Will Deacon, linux-kernel@vger.kernel.org Cc: kernel-team@android.com, iommu@lists.linux.dev, Christoph Hellwig, Marek Szyprowski, Petr Tesarik, Dexuan Cui, Nicolin Chen From: Robin Murphy <robin.murphy@arm.com> Sent: Tuesday, February 27, 2024 5:23 AM > > On 26/02/2024 9:11 pm, Michael Kelley wrote: > > From: Robin Murphy <robin.murphy@arm.com> Sent: Monday, February > 26, 2024 11:36 AM > >> > >> On 21/02/2024 11:35 am, Will Deacon wrote: > >>> From: Nicolin Chen <nicolinc@nvidia.com> > >>> > >>> The swiotlb does not support a mapping size > swiotlb_max_mapping_size(). > >>> On the other hand, with a 64KB PAGE_SIZE configuration, it's observed that > >>> an NVME device can map a size between 300KB~512KB, which certainly failed > >>> the swiotlb mappings, though the default pool of swiotlb has many slots: > >>> systemd[1]: Started Journal Service. > >>> => nvme 0000:00:01.0: swiotlb buffer is full (sz: 327680 bytes), total 32768 (slots), used 32 (slots) > >>> note: journal-offline[392] exited with irqs disabled > >>> note: journal-offline[392] exited with preempt_count 1 > >>> > >>> Call trace: > >>> [ 3.099918] swiotlb_tbl_map_single+0x214/0x240 > >>> [ 3.099921] iommu_dma_map_page+0x218/0x328 > >>> [ 3.099928] dma_map_page_attrs+0x2e8/0x3a0 > >>> [ 3.101985] nvme_prep_rq.part.0+0x408/0x878 [nvme] > >>> [ 3.102308] nvme_queue_rqs+0xc0/0x300 [nvme] > >>> [ 3.102313] blk_mq_flush_plug_list.part.0+0x57c/0x600 > >>> [ 3.102321] blk_add_rq_to_plug+0x180/0x2a0 > >>> [ 3.102323] blk_mq_submit_bio+0x4c8/0x6b8 > >>> [ 3.103463] __submit_bio+0x44/0x220 > >>> [ 3.103468] submit_bio_noacct_nocheck+0x2b8/0x360 > >>> [ 3.103470] submit_bio_noacct+0x180/0x6c8 > >>> [ 3.103471] submit_bio+0x34/0x130 > >>> [ 3.103473] ext4_bio_write_folio+0x5a4/0x8c8 > >>> [ 3.104766] mpage_submit_folio+0xa0/0x100 > >>> [ 3.104769] mpage_map_and_submit_buffers+0x1a4/0x400 > >>> [ 3.104771] ext4_do_writepages+0x6a0/0xd78 > >>> [ 3.105615] ext4_writepages+0x80/0x118 > >>> [ 3.105616] do_writepages+0x90/0x1e8 > >>> [ 3.105619] filemap_fdatawrite_wbc+0x94/0xe0 > >>> [ 3.105622] __filemap_fdatawrite_range+0x68/0xb8 > >>> [ 3.106656] file_write_and_wait_range+0x84/0x120 > >>> [ 3.106658] ext4_sync_file+0x7c/0x4c0 > >>> [ 3.106660] vfs_fsync_range+0x3c/0xa8 > >>> [ 3.106663] do_fsync+0x44/0xc0 > >>> > >>> Since untrusted devices might go down the swiotlb pathway with dma-iommu, > >>> these devices should not map a size larger than swiotlb_max_mapping_size. > >>> > >>> To fix this bug, add iommu_dma_max_mapping_size() for untrusted devices to > >>> take into account swiotlb_max_mapping_size() v.s. iova_rcache_range() from > >>> the iommu_dma_opt_mapping_size(). > >> > >> On the basis that this is at least far closer to correct than doing nothing, > >> > >> Acked-by: Robin Murphy <robin.murphy@arm.com> > >> > >> TBH I'm scared to think about theoretical correctness for all the > >> interactions between the IOVA granule and min_align_mask, since just the > >> SWIOTLB stuff is bad enough, even before you realise the ways that the > >> IOVA allocation isn't necessarily right either. However I reckon as long > >> as we don't ever see a granule smaller than IO_TLB_SIZE, and/or a > >> min_align_mask larger than a granule, then this should probably work > >> well enough as-is. > >> > > > > I'm not convinced. The conditions you describe are reasonable > > and reflect upstream code today. But there can still be a failure > > due to attempting to allocate a "too large" swiotlb buffer. It > > happens with a granule of 64K and min_align_mask of 4K - 1 (the > > NVMe case) when the offset passed to iommu_dma_map_page() > > is non-zero modulo 4K. With this patch, the size passed into > > iommu_dma_map_page() is limited to 252K, but it gets rounded > > up to 256K. Then swiotlb_tbl_map_single() adds the offset > > modulo 4K. The result exceeds the 256K limit in swiotlb and > > the mapping fails. > > > > The case I describe is a reasonable case that happens in the real > > world. As is, this patch will work most of the time because for > > large xfers the offset is typically at least 4K aligned. But not always. > > Indeed that's what my "probably [...] well enough" meant to imply. > > I think there's proving to be sufficient complexity here that if it > turns out we do manage to still hit real-world issues with the coarse > approximation, that will be the point when any further effort would be > better spent on finally tackling the thing that's always been on the > to-do list, where we'd only bounce the unaligned start and/or end > granules while mapping the rest of the buffer in place. > Fair enough. It's your call to make. :-) Michael ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device 2024-02-21 11:35 ` [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device Will Deacon 2024-02-21 23:39 ` Michael Kelley 2024-02-26 19:35 ` Robin Murphy @ 2024-02-27 15:40 ` Christoph Hellwig 2024-02-27 15:53 ` Robin Murphy 2 siblings, 1 reply; 27+ messages in thread From: Christoph Hellwig @ 2024-02-27 15:40 UTC (permalink / raw) To: Will Deacon Cc: linux-kernel, kernel-team, iommu, Christoph Hellwig, Marek Szyprowski, Robin Murphy, Petr Tesarik, Dexuan Cui, Nicolin Chen, Michael Kelley On Wed, Feb 21, 2024 at 11:35:04AM +0000, Will Deacon wrote: > +static size_t iommu_dma_max_mapping_size(struct device *dev) > +{ > + if (is_swiotlb_active(dev) && dev_is_untrusted(dev)) > + return swiotlb_max_mapping_size(dev); Curious: do we really need both checks here? If swiotlb is active for a device (for whatever reason), aren't we then always bound by the max size? If not please add a comment explaining it. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device 2024-02-27 15:40 ` Christoph Hellwig @ 2024-02-27 15:53 ` Robin Murphy 2024-02-28 12:05 ` Will Deacon 0 siblings, 1 reply; 27+ messages in thread From: Robin Murphy @ 2024-02-27 15:53 UTC (permalink / raw) To: Christoph Hellwig, Will Deacon Cc: linux-kernel, kernel-team, iommu, Christoph Hellwig, Marek Szyprowski, Petr Tesarik, Dexuan Cui, Nicolin Chen, Michael Kelley On 27/02/2024 3:40 pm, Christoph Hellwig wrote: > On Wed, Feb 21, 2024 at 11:35:04AM +0000, Will Deacon wrote: >> +static size_t iommu_dma_max_mapping_size(struct device *dev) >> +{ >> + if (is_swiotlb_active(dev) && dev_is_untrusted(dev)) >> + return swiotlb_max_mapping_size(dev); > > Curious: do we really need both checks here? If swiotlb is active > for a device (for whatever reason), aren't we then always bound > by the max size? If not please add a comment explaining it. > Oh, good point - if we have an untrusted device but SWIOTLB isn't initialised for whatever reason, then it doesn't matter what max_mapping_size returns because iommu_dma_map_page() is going to bail out regardless. Thanks, Robin. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device 2024-02-27 15:53 ` Robin Murphy @ 2024-02-28 12:05 ` Will Deacon 0 siblings, 0 replies; 27+ messages in thread From: Will Deacon @ 2024-02-28 12:05 UTC (permalink / raw) To: Robin Murphy Cc: Christoph Hellwig, linux-kernel, kernel-team, iommu, Christoph Hellwig, Marek Szyprowski, Petr Tesarik, Dexuan Cui, Nicolin Chen, Michael Kelley On Tue, Feb 27, 2024 at 03:53:05PM +0000, Robin Murphy wrote: > On 27/02/2024 3:40 pm, Christoph Hellwig wrote: > > On Wed, Feb 21, 2024 at 11:35:04AM +0000, Will Deacon wrote: > > > +static size_t iommu_dma_max_mapping_size(struct device *dev) > > > +{ > > > + if (is_swiotlb_active(dev) && dev_is_untrusted(dev)) > > > + return swiotlb_max_mapping_size(dev); > > > > Curious: do we really need both checks here? If swiotlb is active > > for a device (for whatever reason), aren't we then always bound > > by the max size? If not please add a comment explaining it. > > > > Oh, good point - if we have an untrusted device but SWIOTLB isn't > initialised for whatever reason, then it doesn't matter what > max_mapping_size returns because iommu_dma_map_page() is going to bail out > regardless. Makes sense. Since this is all internal to the IOMMU DMA code, I can just drop the first part of the check. I'll get a v5 out shortly. Will ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 0/5] Fix double allocation in swiotlb_alloc() 2024-02-21 11:34 [PATCH v4 0/5] Fix double allocation in swiotlb_alloc() Will Deacon ` (4 preceding siblings ...) 2024-02-21 11:35 ` [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device Will Deacon @ 2024-02-23 11:34 ` Nicolin Chen 2024-02-23 12:25 ` Will Deacon 5 siblings, 1 reply; 27+ messages in thread From: Nicolin Chen @ 2024-02-23 11:34 UTC (permalink / raw) To: Will Deacon Cc: linux-kernel, kernel-team, iommu, Christoph Hellwig, Marek Szyprowski, Robin Murphy, Petr Tesarik, Dexuan Cui, Michael Kelley On Wed, Feb 21, 2024 at 11:34:59AM +0000, Will Deacon wrote: > Hi again, folks, > > This is version four of the patches which I previously posted at: > > v1: https://lore.kernel.org/r/20240126151956.10014-1-will@kernel.org > v2: https://lore.kernel.org/r/20240131122543.14791-1-will@kernel.org > v3: https://lore.kernel.org/r/20240205190127.20685-1-will@kernel.org > > Thanks to Petr for his Reviewed-by tag on the first three. > > Changes since v3 include: > > - Use umax() instead of max() to fix a build warning if the first > patch is applied to older kernels which warn on signedness > mismatches. > > - Add two new patches to the end of the series to resolve some > additional issues with NVME and 64KiB pages, reported by Nicolin. > I've added them to this series, as the first three patches make it > easier to fix this problem in the SWIOTLB code. > > - Add Reviewed-by tags from Petr > > Cheers, > > Will > > Cc: iommu@lists.linux.dev > Cc: Christoph Hellwig <hch@lst.de> > Cc: Marek Szyprowski <m.szyprowski@samsung.com> > Cc: Robin Murphy <robin.murphy@arm.com> > Cc: Petr Tesarik <petr.tesarik1@huawei-partners.com> > Cc: Dexuan Cui <decui@microsoft.com> > Cc: Nicolin Chen <nicolinc@nvidia.com> This fixes the bug with NVME on arm64/SMMU when PAGE_SIZE=64KiB. Tested-by: Nicolin Chen <nicolinc@nvidia.com> Thanks! ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 0/5] Fix double allocation in swiotlb_alloc() 2024-02-23 11:34 ` [PATCH v4 0/5] Fix double allocation in swiotlb_alloc() Nicolin Chen @ 2024-02-23 12:25 ` Will Deacon 0 siblings, 0 replies; 27+ messages in thread From: Will Deacon @ 2024-02-23 12:25 UTC (permalink / raw) To: Nicolin Chen Cc: linux-kernel, kernel-team, iommu, Christoph Hellwig, Marek Szyprowski, Robin Murphy, Petr Tesarik, Dexuan Cui, Michael Kelley On Fri, Feb 23, 2024 at 03:34:56AM -0800, Nicolin Chen wrote: > On Wed, Feb 21, 2024 at 11:34:59AM +0000, Will Deacon wrote: > > This is version four of the patches which I previously posted at: > > > > v1: https://lore.kernel.org/r/20240126151956.10014-1-will@kernel.org > > v2: https://lore.kernel.org/r/20240131122543.14791-1-will@kernel.org > > v3: https://lore.kernel.org/r/20240205190127.20685-1-will@kernel.org > > > > Thanks to Petr for his Reviewed-by tag on the first three. > > > > Changes since v3 include: > > > > - Use umax() instead of max() to fix a build warning if the first > > patch is applied to older kernels which warn on signedness > > mismatches. > > > > - Add two new patches to the end of the series to resolve some > > additional issues with NVME and 64KiB pages, reported by Nicolin. > > I've added them to this series, as the first three patches make it > > easier to fix this problem in the SWIOTLB code. > > > > - Add Reviewed-by tags from Petr > > > > Cheers, > > > > Will > > > > Cc: iommu@lists.linux.dev > > Cc: Christoph Hellwig <hch@lst.de> > > Cc: Marek Szyprowski <m.szyprowski@samsung.com> > > Cc: Robin Murphy <robin.murphy@arm.com> > > Cc: Petr Tesarik <petr.tesarik1@huawei-partners.com> > > Cc: Dexuan Cui <decui@microsoft.com> > > Cc: Nicolin Chen <nicolinc@nvidia.com> > > This fixes the bug with NVME on arm64/SMMU when PAGE_SIZE=64KiB. > > Tested-by: Nicolin Chen <nicolinc@nvidia.com> Thanks, Nicolin! Please can you also respond to Michael's observation on your patch (5/5)? I didn't think we needed anything extra there, but since it's your patch I'd prefer to hear your opinion. https://lore.kernel.org/lkml/SN6PR02MB4157828120FB7D3408CEC991D4572@SN6PR02MB4157.namprd02.prod.outlook.com/ Cheers, Will ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2024-02-28 12:05 UTC | newest] Thread overview: 27+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-02-21 11:34 [PATCH v4 0/5] Fix double allocation in swiotlb_alloc() Will Deacon 2024-02-21 11:35 ` [PATCH v4 1/5] swiotlb: Fix double-allocation of slots due to broken alignment handling Will Deacon 2024-02-21 23:35 ` Michael Kelley 2024-02-23 12:47 ` Will Deacon 2024-02-23 13:36 ` Petr Tesařík 2024-02-23 17:04 ` Michael Kelley 2024-02-27 15:38 ` Christoph Hellwig 2024-02-21 11:35 ` [PATCH v4 2/5] swiotlb: Enforce page alignment in swiotlb_alloc() Will Deacon 2024-02-21 23:36 ` Michael Kelley 2024-02-21 11:35 ` [PATCH v4 3/5] swiotlb: Honour dma_alloc_coherent() " Will Deacon 2024-02-21 23:36 ` Michael Kelley 2024-02-21 11:35 ` [PATCH v4 4/5] swiotlb: Fix alignment checks when both allocation and DMA masks are present Will Deacon 2024-02-21 23:37 ` Michael Kelley 2024-02-21 11:35 ` [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device Will Deacon 2024-02-21 23:39 ` Michael Kelley 2024-02-23 19:58 ` Nicolin Chen 2024-02-23 21:10 ` Michael Kelley 2024-02-25 21:17 ` Michael Kelley 2024-02-26 19:35 ` Robin Murphy 2024-02-26 21:11 ` Michael Kelley 2024-02-27 13:22 ` Robin Murphy 2024-02-27 14:30 ` Michael Kelley 2024-02-27 15:40 ` Christoph Hellwig 2024-02-27 15:53 ` Robin Murphy 2024-02-28 12:05 ` Will Deacon 2024-02-23 11:34 ` [PATCH v4 0/5] Fix double allocation in swiotlb_alloc() Nicolin Chen 2024-02-23 12:25 ` Will Deacon
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).