* Bug in add_dma_entry()'s debugging code @ 2023-11-27 16:02 Alan Stern 2023-11-27 16:07 ` Christoph Hellwig 0 siblings, 1 reply; 16+ messages in thread From: Alan Stern @ 2023-11-27 16:02 UTC (permalink / raw) To: Hamza Mahfooz, Dan Williams, Christoph Hellwig Cc: Marek Szyprowski, Andrew, Ferry Toth, Andy Shevchenko, Thorsten Leemhuis, iommu, Kernel development list, USB mailing list Among other things, add_dma_entry() in kernel/dma/debug.c prints an error message when it sees two overlapping FROM_DEVICE DMA mappings. The actual overlap detection is performed by a separate routine, active_cacheline_insert(). But the criterion this routine uses is wrong. All it looks for is mappings that start on the same cache line. However on architectures that have cache-coherent DMA (such as x86), touching the same cache line does not mean that two DMA mappings will interfere with each other. To truly overlap, they would have to touch the same _bytes_. The routine does not check for this, and consequently we get error messages about overlapping mappings when in fact there is no overlap. This bug has been reported in https://bugzilla.kernel.org/show_bug.cgi?id=215740 How should this be fixed? Since the check done in add_dma_entry() is completely invalid for x86 and similar architectures, should it simply be removed for them? Or should the check be enhanced to look for byte-granularity overlap? Alan Stern PS: As a separate issue, active_cacheline_insert() fails to detect overlap in situations where a mapping occupies more than one cache line. For example, if mapping A uses lines N and N+1 and mapping B uses line N+1, no overlap will be detected because the radix-tree keys for A and B will be different (N vs. N+1). ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Bug in add_dma_entry()'s debugging code 2023-11-27 16:02 Bug in add_dma_entry()'s debugging code Alan Stern @ 2023-11-27 16:07 ` Christoph Hellwig 2023-11-27 16:51 ` Alan Stern 0 siblings, 1 reply; 16+ messages in thread From: Christoph Hellwig @ 2023-11-27 16:07 UTC (permalink / raw) To: Alan Stern Cc: Hamza Mahfooz, Dan Williams, Christoph Hellwig, Marek Szyprowski, Andrew, Ferry Toth, Andy Shevchenko, Thorsten Leemhuis, iommu, Kernel development list, USB mailing list On Mon, Nov 27, 2023 at 11:02:20AM -0500, Alan Stern wrote: > All it looks for is mappings that start on the same cache line. However > on architectures that have cache-coherent DMA (such as x86), touching > the same cache line does not mean that two DMA mappings will interfere > with each other. To truly overlap, they would have to touch the same > _bytes_. But that is a special case that does not matter. Linux drivers need to be written in a portable way, and that means we have to care about platforms that are not DMA coherent. > How should this be fixed? Since the check done in add_dma_entry() is > completely invalid for x86 and similar architectures, should it simply > be removed for them? Or should the check be enhanced to look for > byte-granularity overlap? The patch is every but "completely invalid". It points out that you violate the Linux DMA API requirements. This might not have an effect on the particular plaform you are currently running on, but it is still wrong. Note that there have been various mumblings about using nosnoop dma on x86, in which case you'll have the issue there as well. > PS: As a separate issue, active_cacheline_insert() fails to detect > overlap in situations where a mapping occupies more than one cache line. > For example, if mapping A uses lines N and N+1 and mapping B uses line > N+1, no overlap will be detected because the radix-tree keys for A and B > will be different (N vs. N+1). Fixes welcome. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Bug in add_dma_entry()'s debugging code 2023-11-27 16:07 ` Christoph Hellwig @ 2023-11-27 16:51 ` Alan Stern 2023-11-28 13:37 ` Christoph Hellwig 0 siblings, 1 reply; 16+ messages in thread From: Alan Stern @ 2023-11-27 16:51 UTC (permalink / raw) To: Christoph Hellwig Cc: Hamza Mahfooz, Dan Williams, Marek Szyprowski, Andrew, Ferry Toth, Andy Shevchenko, Thorsten Leemhuis, iommu, Kernel development list, USB mailing list On Mon, Nov 27, 2023 at 05:07:59PM +0100, Christoph Hellwig wrote: > On Mon, Nov 27, 2023 at 11:02:20AM -0500, Alan Stern wrote: > > All it looks for is mappings that start on the same cache line. However > > on architectures that have cache-coherent DMA (such as x86), touching > > the same cache line does not mean that two DMA mappings will interfere > > with each other. To truly overlap, they would have to touch the same > > _bytes_. > > But that is a special case that does not matter. Linux drivers need > to be written in a portable way, and that means we have to care about > platforms that are not DMA coherent. The buffers in the bug report were allocated by kmalloc(). Doesn't kmalloc() guarantee that on architectures with non-cache-coherent DMA, allocations will be aligned on cache-line boundaries (or larger)? Isn't this what ARCH_DMA_MINALIGN and ARCH_KMALLOC_MINALIGN are supposed to take care of in include/linux/slab.h? > > How should this be fixed? Since the check done in add_dma_entry() is > > completely invalid for x86 and similar architectures, should it simply > > be removed for them? Or should the check be enhanced to look for > > byte-granularity overlap? > > The patch is every but "completely invalid". It points out that you > violate the Linux DMA API requirements. Since when does the DMA API require that mappings on x86 must be to separate cache lines? Is this documented anywhere? For that matter, Documentation/core-api/dma-api-howto.rst explicitly says: If you acquired your memory via the page allocator (i.e. __get_free_page*()) or the generic memory allocators (i.e. kmalloc() or kmem_cache_alloc()) then you may DMA to/from that memory using the addresses returned from those routines. It also says: Architectures must ensure that kmalloc'ed buffer is DMA-safe. Drivers and subsystems depend on it. If an architecture isn't fully DMA-coherent (i.e. hardware doesn't ensure that data in the CPU cache is identical to data in main memory), ARCH_DMA_MINALIGN must be set so that the memory allocator makes sure that kmalloc'ed buffer doesn't share a cache line with the others. See arch/arm/include/asm/cache.h as an example. It says nothing about avoiding more than one DMA operation at a time to prevent overlap. Is the documentation wrong? > This might not have an > effect on the particular plaform you are currently running on, but it > is still wrong. Who decides what is right and what is wrong in this area? Clearly you have a different opinion from David S. Miller, Richard Henderson, and Jakub Jelinek (the authors of that documentation file). > Note that there have been various mumblings about > using nosnoop dma on x86, in which case you'll have the issue there > as well. Unless the people who implement nosnoop DMA also modify kmalloc() or ARCH_DMA_MINALIGN. I guess the real question boils down to where the responsiblity should lie. Should kmalloc() guarantee that the memory regions it provides will always be suitable for DMA, with no overlap issues? Or should all the innumerable users of kmalloc() be responsible for jumping through hoops to request arch-specific minimum alignment for their DMA buffers? Alan Stern ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Bug in add_dma_entry()'s debugging code 2023-11-27 16:51 ` Alan Stern @ 2023-11-28 13:37 ` Christoph Hellwig 2023-11-28 15:18 ` Alan Stern 0 siblings, 1 reply; 16+ messages in thread From: Christoph Hellwig @ 2023-11-28 13:37 UTC (permalink / raw) To: Alan Stern Cc: Christoph Hellwig, Hamza Mahfooz, Dan Williams, Marek Szyprowski, Andrew, Ferry Toth, Andy Shevchenko, Thorsten Leemhuis, iommu, Kernel development list, USB mailing list, Catalin Marinas, Robin Murphy, Will Deacon On Mon, Nov 27, 2023 at 11:51:21AM -0500, Alan Stern wrote: > The buffers in the bug report were allocated by kmalloc(). Doesn't > kmalloc() guarantee that on architectures with non-cache-coherent DMA, > allocations will be aligned on cache-line boundaries (or larger)? Isn't > this what ARCH_DMA_MINALIGN and ARCH_KMALLOC_MINALIGN are supposed to > take care of in include/linux/slab.h? Oh. Yes, the variable alignment from kmalloc make things complicated. > Architectures must ensure that kmalloc'ed buffer is > DMA-safe. Drivers and subsystems depend on it. If an architecture > isn't fully DMA-coherent (i.e. hardware doesn't ensure that data in > the CPU cache is identical to data in main memory), > ARCH_DMA_MINALIGN must be set so that the memory allocator > makes sure that kmalloc'ed buffer doesn't share a cache line with > the others. See arch/arm/include/asm/cache.h as an example. > > It says nothing about avoiding more than one DMA operation at a time to > prevent overlap. Is the documentation wrong? The documentation is a bit lacking unfortunately. Again, assuming you actually have non-coherent mappings you'd easily break the fragile cache line ownership if you did. That doesn't apply to x86 as-is, but we really try to keep drivers portable. > > This might not have an > > effect on the particular plaform you are currently running on, but it > > is still wrong. > > Who decides what is right and what is wrong in this area? Clearly you > have a different opinion from David S. Miller, Richard Henderson, and > Jakub Jelinek (the authors of that documentation file). I don't think this about anyone's opinion. It's a fundamental propery of how to manage caches in the face of non-coherent DMA. > > > Note that there have been various mumblings about > > using nosnoop dma on x86, in which case you'll have the issue there > > as well. > > Unless the people who implement nosnoop DMA also modify kmalloc() or > ARCH_DMA_MINALIGN. Or don't do it on kmalloc buffers. > I guess the real question boils down to where the responsiblity should > lie. Should kmalloc() guarantee that the memory regions it provides > will always be suitable for DMA, with no overlap issues? Or should all > the innumerable users of kmalloc() be responsible for jumping through > hoops to request arch-specific minimum alignment for their DMA buffers? I'd actually go one step back: 1) for not cache coherent DMA you can't do overlapping operations inside a cache line 2) dma-debug is intended to find DMA API misuses, even if they don't have bad effects on your particular system 3) the fact that the kmalloc implementation returns differently aligned memory depending on the platform breaks how dma-debug works currently The logical confcusion from that would be that IFF dma-debug is enabled on any platform we need to set ARCH_DMA_MINALIGN to the cache line size. BUT: we're actually reduzing our dependency on ARCH_DMA_MINALIGN by moving to bounce buffering unaligned memory for non-coherent architectures, which makes this even more complicated. Right now I don't have a good idea how to actually deal with having the cachline sanity checks with that, but I'm Ccing some of the usual suspects if they have a better idea. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Bug in add_dma_entry()'s debugging code 2023-11-28 13:37 ` Christoph Hellwig @ 2023-11-28 15:18 ` Alan Stern 2023-11-28 16:31 ` Robin Murphy 2023-11-28 17:44 ` Catalin Marinas 0 siblings, 2 replies; 16+ messages in thread From: Alan Stern @ 2023-11-28 15:18 UTC (permalink / raw) To: Christoph Hellwig Cc: Hamza Mahfooz, Dan Williams, Marek Szyprowski, Andrew, Ferry Toth, Andy Shevchenko, Thorsten Leemhuis, iommu, Kernel development list, USB mailing list, Catalin Marinas, Robin Murphy, Will Deacon On Tue, Nov 28, 2023 at 02:37:02PM +0100, Christoph Hellwig wrote: > I'd actually go one step back: > > 1) for not cache coherent DMA you can't do overlapping operations inside > a cache line Rephrasing slightly: You mustn't perform multiple non-cache-coherent DMA operations that touch the same cache line concurrently. (The word "overlapping" is a a little ambiguous in this context.) (Right now dma-debug considers only DMA-IN operations. In theory this restriction should apply even when some of the concurrent operations are DMA-OUT, provided that at least one of them is DMA-IN. Minor point...) > 2) dma-debug is intended to find DMA API misuses, even if they don't > have bad effects on your particular system > 3) the fact that the kmalloc implementation returns differently aligned > memory depending on the platform breaks how dma-debug works currently Exactly. That's essentially what Bugzilla #215740 is about. > The logical confcusion from that would be that IFF dma-debug is enabled on > any platform we need to set ARCH_DMA_MINALIGN to the cache line size. (IF, not IFF.) And tell distributions that CONFIG_DMA_API_DEBUG is not meant for production systems but rather for kernel testing, right? > BUT: we're actually reduzing our dependency on ARCH_DMA_MINALIGN by > moving to bounce buffering unaligned memory for non-coherent > architectures, What's the reason for this? To allow the minimum allocation size to be smaller than the cache line size? Does the savings in memory make up for the extra overhead of bounce buffering? Or is this just to allow people to be more careless about how they allocate their DMA buffers (which doesn't seem to make sense)? > which makes this even more complicated. Right now I > don't have a good idea how to actually deal with having the cachline > sanity checks with that, but I'm Ccing some of the usual suspects if > they have a better idea. I get the impression that you would really like to have two different versions of kmalloc() and friends: one for buffers that will be used in DMA (and hence require cache-line alignment) and one for buffers that won't be. Alan Stern ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Bug in add_dma_entry()'s debugging code 2023-11-28 15:18 ` Alan Stern @ 2023-11-28 16:31 ` Robin Murphy 2023-11-28 16:34 ` Andy Shevchenko 2023-11-28 17:44 ` Catalin Marinas 1 sibling, 1 reply; 16+ messages in thread From: Robin Murphy @ 2023-11-28 16:31 UTC (permalink / raw) To: Alan Stern, Christoph Hellwig Cc: Hamza Mahfooz, Dan Williams, Marek Szyprowski, Andrew, Ferry Toth, Andy Shevchenko, Thorsten Leemhuis, iommu, Kernel development list, USB mailing list, Catalin Marinas, Will Deacon On 28/11/2023 3:18 pm, Alan Stern wrote: > On Tue, Nov 28, 2023 at 02:37:02PM +0100, Christoph Hellwig wrote: >> I'd actually go one step back: >> >> 1) for not cache coherent DMA you can't do overlapping operations inside >> a cache line > > Rephrasing slightly: You mustn't perform multiple non-cache-coherent DMA > operations that touch the same cache line concurrently. (The word > "overlapping" is a a little ambiguous in this context.) > > (Right now dma-debug considers only DMA-IN operations. In theory this > restriction should apply even when some of the concurrent operations are > DMA-OUT, provided that at least one of them is DMA-IN. Minor point...) > >> 2) dma-debug is intended to find DMA API misuses, even if they don't >> have bad effects on your particular system >> 3) the fact that the kmalloc implementation returns differently aligned >> memory depending on the platform breaks how dma-debug works currently > > Exactly. That's essentially what Bugzilla #215740 is about. > >> The logical confcusion from that would be that IFF dma-debug is enabled on >> any platform we need to set ARCH_DMA_MINALIGN to the cache line size. > > (IF, not IFF.) And tell distributions that CONFIG_DMA_API_DEBUG is not > meant for production systems but rather for kernel testing, right? Yikes, I'd hope that distros are heeding the warning that the Kconfig calls out already. It's perhaps somewhat understated, as I'd describe the performance impact to large modern systems with high-bandwidth I/O as somewhere between "severe" and "crippling". >> BUT: we're actually reduzing our dependency on ARCH_DMA_MINALIGN by >> moving to bounce buffering unaligned memory for non-coherent >> architectures, > > What's the reason for this? To allow the minimum allocation size to be > smaller than the cache line size? Does the savings in memory make up > for the extra overhead of bounce buffering? Yes, on systems where non-coherent streaming DMA is expected to be rare, or at least not performance-critical, not having to allocate 128 bytes every time we want just 8 or so soon adds up. > Or is this just to allow people to be more careless about how they > allocate their DMA buffers (which doesn't seem to make sense)? > >> which makes this even more complicated. Right now I >> don't have a good idea how to actually deal with having the cachline >> sanity checks with that, but I'm Ccing some of the usual suspects if >> they have a better idea. > > I get the impression that you would really like to have two different > versions of kmalloc() and friends: one for buffers that will be used in > DMA (and hence require cache-line alignment) and one for buffers that > won't be. That approach was mooted, but still has potentially-fiddly corner cases like when the DMA buffer is a member of a larger struct, so we settled on the bounce-buffering solution as the most robust compromise. Thanks, Robin. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Bug in add_dma_entry()'s debugging code 2023-11-28 16:31 ` Robin Murphy @ 2023-11-28 16:34 ` Andy Shevchenko 2023-11-28 16:54 ` Robin Murphy 0 siblings, 1 reply; 16+ messages in thread From: Andy Shevchenko @ 2023-11-28 16:34 UTC (permalink / raw) To: Robin Murphy Cc: Alan Stern, Christoph Hellwig, Hamza Mahfooz, Dan Williams, Marek Szyprowski, Andrew, Ferry Toth, Thorsten Leemhuis, iommu, Kernel development list, USB mailing list, Catalin Marinas, Will Deacon On Tue, Nov 28, 2023 at 6:31 PM Robin Murphy <robin.murphy@arm.com> wrote: > On 28/11/2023 3:18 pm, Alan Stern wrote: > > On Tue, Nov 28, 2023 at 02:37:02PM +0100, Christoph Hellwig wrote: ... > >> The logical confcusion from that would be that IFF dma-debug is enabled on > >> any platform we need to set ARCH_DMA_MINALIGN to the cache line size. > > > > (IF, not IFF.) And tell distributions that CONFIG_DMA_API_DEBUG is not > > meant for production systems but rather for kernel testing, right? > > Yikes, I'd hope that distros are heeding the warning that the Kconfig > calls out already. It's perhaps somewhat understated, as I'd describe > the performance impact to large modern systems with high-bandwidth I/O > as somewhere between "severe" and "crippling". Independently on the distros configurations the (false positive) message(s) will make difficult to debug the actual things, shouldn't we have our code robust in any case? -- With Best Regards, Andy Shevchenko ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Bug in add_dma_entry()'s debugging code 2023-11-28 16:34 ` Andy Shevchenko @ 2023-11-28 16:54 ` Robin Murphy 0 siblings, 0 replies; 16+ messages in thread From: Robin Murphy @ 2023-11-28 16:54 UTC (permalink / raw) To: Andy Shevchenko Cc: Alan Stern, Christoph Hellwig, Hamza Mahfooz, Dan Williams, Marek Szyprowski, Andrew, Ferry Toth, Thorsten Leemhuis, iommu, Kernel development list, USB mailing list, Catalin Marinas, Will Deacon On 28/11/2023 4:34 pm, Andy Shevchenko wrote: > On Tue, Nov 28, 2023 at 6:31 PM Robin Murphy <robin.murphy@arm.com> wrote: >> On 28/11/2023 3:18 pm, Alan Stern wrote: >>> On Tue, Nov 28, 2023 at 02:37:02PM +0100, Christoph Hellwig wrote: > > ... > >>>> The logical confcusion from that would be that IFF dma-debug is enabled on >>>> any platform we need to set ARCH_DMA_MINALIGN to the cache line size. >>> >>> (IF, not IFF.) And tell distributions that CONFIG_DMA_API_DEBUG is not >>> meant for production systems but rather for kernel testing, right? >> >> Yikes, I'd hope that distros are heeding the warning that the Kconfig >> calls out already. It's perhaps somewhat understated, as I'd describe >> the performance impact to large modern systems with high-bandwidth I/O >> as somewhere between "severe" and "crippling". > > Independently on the distros configurations the (false positive) > message(s) will make difficult to debug the actual things, shouldn't > we have our code robust in any case? Sure, I have no objection to making dma-debug more robust and useful for its intended purpose, I was just commenting on the fact that any potential change in behaviour from this should be of less concern to distros than the significant change in behaviour that enabling it *already* poses (i.e. globally serialising DMA operations and doing inherently slow stuff in what are normally expected to be fast paths). Thanks, Robin. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Bug in add_dma_entry()'s debugging code 2023-11-28 15:18 ` Alan Stern 2023-11-28 16:31 ` Robin Murphy @ 2023-11-28 17:44 ` Catalin Marinas 2023-11-30 20:08 ` Ferry Toth 1 sibling, 1 reply; 16+ messages in thread From: Catalin Marinas @ 2023-11-28 17:44 UTC (permalink / raw) To: Alan Stern Cc: Christoph Hellwig, Hamza Mahfooz, Dan Williams, Marek Szyprowski, Andrew, Ferry Toth, Andy Shevchenko, Thorsten Leemhuis, iommu, Kernel development list, USB mailing list, Robin Murphy, Will Deacon On Tue, Nov 28, 2023 at 10:18:19AM -0500, Alan Stern wrote: > On Tue, Nov 28, 2023 at 02:37:02PM +0100, Christoph Hellwig wrote: > > I'd actually go one step back: > > > > 1) for not cache coherent DMA you can't do overlapping operations inside > > a cache line > > Rephrasing slightly: You mustn't perform multiple non-cache-coherent DMA > operations that touch the same cache line concurrently. (The word > "overlapping" is a a little ambiguous in this context.) The problem is worse. I'd say you should not perform even a single non-cache-coherent DMA (usually from-device or bidirectional) operation if the cache line is shared with anything else modifying it. It doesn't need to be another DMA operation. But that's more difficult to add to the DMA API debug code (maybe something like the bouncing logic in dma_kmalloc_needs_bounce()). > > The logical confcusion from that would be that IFF dma-debug is enabled on > > any platform we need to set ARCH_DMA_MINALIGN to the cache line size. Or just force the kmalloc() min align to cache_line_size(), something like: diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 4a658de44ee9..3ece20367636 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -543,6 +543,8 @@ static inline int dma_get_cache_alignment(void) #ifdef ARCH_HAS_DMA_MINALIGN return ARCH_DMA_MINALIGN; #endif + if (IS_ENABLED(CONFIG_DMA_API_DEBUG)) + return cache_line_size(); return 1; } #endif diff --git a/mm/slab_common.c b/mm/slab_common.c index 8d431193c273..d0b21d6e9328 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -879,7 +879,7 @@ static unsigned int __kmalloc_minalign(void) unsigned int minalign = dma_get_cache_alignment(); if (IS_ENABLED(CONFIG_DMA_BOUNCE_UNALIGNED_KMALLOC) && - is_swiotlb_allocated()) + is_swiotlb_allocated() && !IS_ENABLED(CONFIG_DMA_API_DEBUG)) minalign = ARCH_KMALLOC_MINALIGN; return max(minalign, arch_slab_minalign()); Also note that to_cacheline_number() in kernel/dma/debug.c only takes into account the L1_CACHE_SHIFT. On arm64 for example, cache_line_size() returns the maximum line of all the cache levels (and we've seen hardware where the L1 is 64-byte, L2 is 128). > > BUT: we're actually reduzing our dependency on ARCH_DMA_MINALIGN by > > moving to bounce buffering unaligned memory for non-coherent > > architectures, > > What's the reason for this? To allow the minimum allocation size to be > smaller than the cache line size? Does the savings in memory make up > for the extra overhead of bounce buffering? > > Or is this just to allow people to be more careless about how they > allocate their DMA buffers (which doesn't seem to make sense)? It's the former and it does make a difference with lots of small structure or string allocations. [...] > I get the impression that you would really like to have two different > versions of kmalloc() and friends: one for buffers that will be used in > DMA (and hence require cache-line alignment) and one for buffers that > won't be. We've been there for the past 2-3 years (and a few other options). It's hard to guess in a generic way because the allocation place may not necessarily know how the device is going to access that memory (PIO, DMA). The conclusion was that for those cases where the buffer is small, we just do a bounce. If it's performance critical, the driver can use a kmem_cache_create(SLAB_HWCACHE_ALIGN) and avoid the bouncing. -- Catalin ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: Bug in add_dma_entry()'s debugging code 2023-11-28 17:44 ` Catalin Marinas @ 2023-11-30 20:08 ` Ferry Toth 2023-12-01 11:08 ` Catalin Marinas 0 siblings, 1 reply; 16+ messages in thread From: Ferry Toth @ 2023-11-30 20:08 UTC (permalink / raw) To: Catalin Marinas, Alan Stern Cc: Christoph Hellwig, Hamza Mahfooz, Dan Williams, Marek Szyprowski, Andrew, Ferry Toth, Andy Shevchenko, Thorsten Leemhuis, iommu, Kernel development list, USB mailing list, Robin Murphy, Will Deacon Hi, Op 28-11-2023 om 18:44 schreef Catalin Marinas: > On Tue, Nov 28, 2023 at 10:18:19AM -0500, Alan Stern wrote: >> On Tue, Nov 28, 2023 at 02:37:02PM +0100, Christoph Hellwig wrote: >>> I'd actually go one step back: >>> >>> 1) for not cache coherent DMA you can't do overlapping operations inside >>> a cache line >> >> Rephrasing slightly: You mustn't perform multiple non-cache-coherent DMA >> operations that touch the same cache line concurrently. (The word >> "overlapping" is a a little ambiguous in this context.) > > The problem is worse. I'd say you should not perform even a single > non-cache-coherent DMA (usually from-device or bidirectional) operation > if the cache line is shared with anything else modifying it. It doesn't > need to be another DMA operation. But that's more difficult to add to > the DMA API debug code (maybe something like the bouncing logic in > dma_kmalloc_needs_bounce()). > >>> The logical confcusion from that would be that IFF dma-debug is enabled on >>> any platform we need to set ARCH_DMA_MINALIGN to the cache line size. > > Or just force the kmalloc() min align to cache_line_size(), something > like: > > diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h > index 4a658de44ee9..3ece20367636 100644 > --- a/include/linux/dma-mapping.h > +++ b/include/linux/dma-mapping.h > @@ -543,6 +543,8 @@ static inline int dma_get_cache_alignment(void) > #ifdef ARCH_HAS_DMA_MINALIGN > return ARCH_DMA_MINALIGN; > #endif > + if (IS_ENABLED(CONFIG_DMA_API_DEBUG)) > + return cache_line_size(); > return 1; > } > #endif > diff --git a/mm/slab_common.c b/mm/slab_common.c > index 8d431193c273..d0b21d6e9328 100644 > --- a/mm/slab_common.c > +++ b/mm/slab_common.c > @@ -879,7 +879,7 @@ static unsigned int __kmalloc_minalign(void) > unsigned int minalign = dma_get_cache_alignment(); > > if (IS_ENABLED(CONFIG_DMA_BOUNCE_UNALIGNED_KMALLOC) && > - is_swiotlb_allocated()) > + is_swiotlb_allocated() && !IS_ENABLED(CONFIG_DMA_API_DEBUG)) > minalign = ARCH_KMALLOC_MINALIGN; > > return max(minalign, arch_slab_minalign()); With above suggestion "force the kmalloc() min align to cache_line_size()" + Alan's debug code: root@yuna:~# journalctl -k | grep hub kernel: usbcore: registered new interface driver hub kernel: hub 1-0:1.0: USB hub found kernel: usb usb1: hub buffer at 71c7180, status at 71c71c0 kernel: hub 1-0:1.0: 1 port detected kernel: hub 2-0:1.0: USB hub found kernel: usb usb2: hub buffer at 71c79c0, status at 71c7a00 kernel: hub 2-0:1.0: 1 port detected kernel: hub 1-1:1.0: USB hub found kernel: usb 1-1: hub buffer at 65b36c0, status at 6639340 kernel: hub 1-1:1.0: 7 ports detected and the stack trace indeed goes away. IOW also the 2 root hub kmalloc() are now also aligned to the cache line size, even though these never triggered the stack trace. Strange: hub status is aligned far away from hub buffer, kmalloc mysteries. This still did not land for me: are we detecting a false alarm here as the 2 DMA operations can never happen on the same cache line on non-cache-coherent platforms? If so, shouldn't we fix up the dma debug code to not detect a false alarm? Instead of changing the alignment? Or, is this a bonafide warning (for non-cache-coherent platforms)? Then we should not silence it by force aligning it, but issue a WARN (on a cache coherent platform) that is more useful (i.e. here we have not an overlap but a shared cache line). On a non-cache coherent platform something stronger than a WARN might be appropriate? > Also note that to_cacheline_number() in kernel/dma/debug.c only takes > into account the L1_CACHE_SHIFT. On arm64 for example, cache_line_size() > returns the maximum line of all the cache levels (and we've seen > hardware where the L1 is 64-byte, L2 is 128). > >>> BUT: we're actually reduzing our dependency on ARCH_DMA_MINALIGN by >>> moving to bounce buffering unaligned memory for non-coherent >>> architectures, >> >> What's the reason for this? To allow the minimum allocation size to be >> smaller than the cache line size? Does the savings in memory make up >> for the extra overhead of bounce buffering? >> >> Or is this just to allow people to be more careless about how they >> allocate their DMA buffers (which doesn't seem to make sense)? > > It's the former and it does make a difference with lots of small > structure or string allocations. > > [...] >> I get the impression that you would really like to have two different >> versions of kmalloc() and friends: one for buffers that will be used in >> DMA (and hence require cache-line alignment) and one for buffers that >> won't be. > > We've been there for the past 2-3 years (and a few other options). It's > hard to guess in a generic way because the allocation place may not > necessarily know how the device is going to access that memory (PIO, > DMA). The conclusion was that for those cases where the buffer is small, > we just do a bounce. If it's performance critical, the driver can use a > kmem_cache_create(SLAB_HWCACHE_ALIGN) and avoid the bouncing. > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Bug in add_dma_entry()'s debugging code 2023-11-30 20:08 ` Ferry Toth @ 2023-12-01 11:08 ` Catalin Marinas 2023-12-01 12:17 ` Ferry Toth 0 siblings, 1 reply; 16+ messages in thread From: Catalin Marinas @ 2023-12-01 11:08 UTC (permalink / raw) To: Ferry Toth Cc: Alan Stern, Christoph Hellwig, Hamza Mahfooz, Dan Williams, Marek Szyprowski, Andrew, Ferry Toth, Andy Shevchenko, Thorsten Leemhuis, iommu, Kernel development list, USB mailing list, Robin Murphy, Will Deacon On Thu, Nov 30, 2023 at 09:08:23PM +0100, Ferry Toth wrote: > Op 28-11-2023 om 18:44 schreef Catalin Marinas: > > Or just force the kmalloc() min align to cache_line_size(), something > > like: > > > > diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h > > index 4a658de44ee9..3ece20367636 100644 > > --- a/include/linux/dma-mapping.h > > +++ b/include/linux/dma-mapping.h > > @@ -543,6 +543,8 @@ static inline int dma_get_cache_alignment(void) > > #ifdef ARCH_HAS_DMA_MINALIGN > > return ARCH_DMA_MINALIGN; > > #endif > > + if (IS_ENABLED(CONFIG_DMA_API_DEBUG)) > > + return cache_line_size(); > > return 1; > > } > > #endif > > diff --git a/mm/slab_common.c b/mm/slab_common.c > > index 8d431193c273..d0b21d6e9328 100644 > > --- a/mm/slab_common.c > > +++ b/mm/slab_common.c > > @@ -879,7 +879,7 @@ static unsigned int __kmalloc_minalign(void) > > unsigned int minalign = dma_get_cache_alignment(); > > if (IS_ENABLED(CONFIG_DMA_BOUNCE_UNALIGNED_KMALLOC) && > > - is_swiotlb_allocated()) > > + is_swiotlb_allocated() && !IS_ENABLED(CONFIG_DMA_API_DEBUG)) > > minalign = ARCH_KMALLOC_MINALIGN; > > return max(minalign, arch_slab_minalign()); > > With above suggestion "force the kmalloc() min align to cache_line_size()" + > Alan's debug code: > > root@yuna:~# journalctl -k | grep hub > kernel: usbcore: registered new interface driver hub > kernel: hub 1-0:1.0: USB hub found > kernel: usb usb1: hub buffer at 71c7180, status at 71c71c0 > kernel: hub 1-0:1.0: 1 port detected > kernel: hub 2-0:1.0: USB hub found > kernel: usb usb2: hub buffer at 71c79c0, status at 71c7a00 > kernel: hub 2-0:1.0: 1 port detected > kernel: hub 1-1:1.0: USB hub found > kernel: usb 1-1: hub buffer at 65b36c0, status at 6639340 > kernel: hub 1-1:1.0: 7 ports detected > > and the stack trace indeed goes away. > > IOW also the 2 root hub kmalloc() are now also aligned to the cache line > size, even though these never triggered the stack trace. Strange: hub status > is aligned far away from hub buffer, kmalloc mysteries. They are 64 bytes apart in most cases other than the last one which I guess the status had to go to a different slab page. > This still did not land for me: are we detecting a false alarm here as the 2 > DMA operations can never happen on the same cache line on non-cache-coherent > platforms? If so, shouldn't we fix up the dma debug code to not detect a > false alarm? Instead of changing the alignment? It's a false alarm indeed on this hardware since the DMA is cache-coherent. I think Christoph mentioned earlier in this thread that he'd like the DMA API debug to report potential problems even if the hardware it is running on is safe. > Or, is this a bonafide warning (for non-cache-coherent platforms)? Then we > should not silence it by force aligning it, but issue a WARN (on a cache > coherent platform) that is more useful (i.e. here we have not an overlap but > a shared cache line). On a non-cache coherent platform something stronger > than a WARN might be appropriate? A non-cache coherent platform would either have the kmalloc() allocations aligned or it would bounce those small, unaligned buffers. So it doesn't seem right to issue a warning on x86 where kmalloc() minimum alignment is 8 and not getting the warning on a non-coherent platform which forces the kmalloc() alignment. If we consider the kmalloc() aspect already covered by bouncing or force alignment, the DMA API debug code can still detect other cache line sharing situations. -- Catalin ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Bug in add_dma_entry()'s debugging code 2023-12-01 11:08 ` Catalin Marinas @ 2023-12-01 12:17 ` Ferry Toth 2023-12-01 16:21 ` Alan Stern 0 siblings, 1 reply; 16+ messages in thread From: Ferry Toth @ 2023-12-01 12:17 UTC (permalink / raw) To: Catalin Marinas, Ferry Toth Cc: Alan Stern, Christoph Hellwig, Hamza Mahfooz, Dan Williams, Marek Szyprowski, Andrew, Andy Shevchenko, Thorsten Leemhuis, iommu, Kernel development list, USB mailing list, Robin Murphy, Will Deacon Hi On 01-12-2023 12:08, Catalin Marinas wrote: > On Thu, Nov 30, 2023 at 09:08:23PM +0100, Ferry Toth wrote: >> Op 28-11-2023 om 18:44 schreef Catalin Marinas: >>> Or just force the kmalloc() min align to cache_line_size(), something >>> like: >>> >>> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h >>> index 4a658de44ee9..3ece20367636 100644 >>> --- a/include/linux/dma-mapping.h >>> +++ b/include/linux/dma-mapping.h >>> @@ -543,6 +543,8 @@ static inline int dma_get_cache_alignment(void) >>> #ifdef ARCH_HAS_DMA_MINALIGN >>> return ARCH_DMA_MINALIGN; >>> #endif >>> + if (IS_ENABLED(CONFIG_DMA_API_DEBUG)) >>> + return cache_line_size(); >>> return 1; >>> } >>> #endif >>> diff --git a/mm/slab_common.c b/mm/slab_common.c >>> index 8d431193c273..d0b21d6e9328 100644 >>> --- a/mm/slab_common.c >>> +++ b/mm/slab_common.c >>> @@ -879,7 +879,7 @@ static unsigned int __kmalloc_minalign(void) >>> unsigned int minalign = dma_get_cache_alignment(); >>> if (IS_ENABLED(CONFIG_DMA_BOUNCE_UNALIGNED_KMALLOC) && >>> - is_swiotlb_allocated()) >>> + is_swiotlb_allocated() && !IS_ENABLED(CONFIG_DMA_API_DEBUG)) >>> minalign = ARCH_KMALLOC_MINALIGN; >>> return max(minalign, arch_slab_minalign()); >> With above suggestion "force the kmalloc() min align to cache_line_size()" + >> Alan's debug code: >> >> root@yuna:~# journalctl -k | grep hub >> kernel: usbcore: registered new interface driver hub >> kernel: hub 1-0:1.0: USB hub found >> kernel: usb usb1: hub buffer at 71c7180, status at 71c71c0 >> kernel: hub 1-0:1.0: 1 port detected >> kernel: hub 2-0:1.0: USB hub found >> kernel: usb usb2: hub buffer at 71c79c0, status at 71c7a00 >> kernel: hub 2-0:1.0: 1 port detected >> kernel: hub 1-1:1.0: USB hub found >> kernel: usb 1-1: hub buffer at 65b36c0, status at 6639340 >> kernel: hub 1-1:1.0: 7 ports detected >> >> and the stack trace indeed goes away. >> >> IOW also the 2 root hub kmalloc() are now also aligned to the cache line >> size, even though these never triggered the stack trace. Strange: hub status >> is aligned far away from hub buffer, kmalloc mysteries. > They are 64 bytes apart in most cases other than the last one which I > guess the status had to go to a different slab page. > >> This still did not land for me: are we detecting a false alarm here as the 2 >> DMA operations can never happen on the same cache line on non-cache-coherent >> platforms? If so, shouldn't we fix up the dma debug code to not detect a >> false alarm? Instead of changing the alignment? > It's a false alarm indeed on this hardware since the DMA is > cache-coherent. I think Christoph mentioned earlier in this thread that > he'd like the DMA API debug to report potential problems even if the > hardware it is running on is safe. I agree with Christoph. So, my question is "are we detecting a false alarm", not "a false alarm indeed on this hardware"? >> Or, is this a bonafide warning (for non-cache-coherent platforms)? Then we >> should not silence it by force aligning it, but issue a WARN (on a cache >> coherent platform) that is more useful (i.e. here we have not an overlap but >> a shared cache line). On a non-cache coherent platform something stronger >> than a WARN might be appropriate? > A non-cache coherent platform would either have the kmalloc() > allocations aligned or it would bounce those small, unaligned buffers. It would? Or it always has? > So it doesn't seem right to issue a warning on x86 where kmalloc() > minimum alignment is 8 and not getting the warning on a non-coherent > platform which forces the kmalloc() alignment. If *all*non-coherent platforms implement either correct alignment or bounce buffer, and on (coherent) x86 we get an WARN, then it is a false alarm right? That is exactly my question (because I have no idea which platforms have non-coherent caches). > If we consider the kmalloc() aspect already covered by bouncing or force > alignment, the DMA API debug code can still detect other cache line > sharing situations. Consider? Or know? If we know, and we can not easily prevent false WARN's on x86 it could be sufficient to change the message to "possible cacheline sharing detected" and add a line that "DMA_API_DEBUG" should be disabled on production systems. And while at it, we might be able to fix the missed real cacheline sharing mentioned by Alan: "As a separate issue, active_cacheline_insert() fails to detect overlap in situations where a mapping occupies more than one cache line. For example, if mapping A uses lines N and N+1 and mapping B uses line N+1, no overlap will be detected because the radix-tree keys for A and B will be different (N vs. N+1)." ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Bug in add_dma_entry()'s debugging code 2023-12-01 12:17 ` Ferry Toth @ 2023-12-01 16:21 ` Alan Stern 2023-12-01 17:42 ` Catalin Marinas 0 siblings, 1 reply; 16+ messages in thread From: Alan Stern @ 2023-12-01 16:21 UTC (permalink / raw) To: Ferry Toth Cc: Catalin Marinas, Ferry Toth, Christoph Hellwig, Hamza Mahfooz, Dan Williams, Marek Szyprowski, Andrew, Andy Shevchenko, Thorsten Leemhuis, iommu, Kernel development list, USB mailing list, Robin Murphy, Will Deacon On Fri, Dec 01, 2023 at 01:17:43PM +0100, Ferry Toth wrote: > > A non-cache coherent platform would either have the kmalloc() > > allocations aligned or it would bounce those small, unaligned buffers. > It would? Or it always has? > > So it doesn't seem right to issue a warning on x86 where kmalloc() > > minimum alignment is 8 and not getting the warning on a non-coherent > > platform which forces the kmalloc() alignment. > > If *all*non-coherent platforms implement either correct alignment or bounce > buffer, and on (coherent) x86 we get an WARN, then it is a false alarm > right? > > That is exactly my question (because I have no idea which platforms have > non-coherent caches). Don't forget, not all DMA buffers are allocated by kmalloc(). A buffer allocated by some other means might not be aligned properly and might share a cache line with another buffer. Or you might have a single data structure that was allocated by kmalloc() and then create separate DMA mappings for two members of that structure. If the two members are in the same cache line, that would be an error. Alan Stern > > If we consider the kmalloc() aspect already covered by bouncing or force > > alignment, the DMA API debug code can still detect other cache line > > sharing situations. > > Consider? Or know? > > If we know, and we can not easily prevent false WARN's on x86 it could be > sufficient to change the message to "possible cacheline sharing detected" > and add a line that "DMA_API_DEBUG" should be disabled on production > systems. > > And while at it, we might be able to fix the missed real cacheline sharing > mentioned by Alan: > > "As a separate issue, active_cacheline_insert() fails to detect > overlap in situations where a mapping occupies more than one cache line. > For example, if mapping A uses lines N and N+1 and mapping B uses line > N+1, no overlap will be detected because the radix-tree keys for A and B > will be different (N vs. N+1)." > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Bug in add_dma_entry()'s debugging code 2023-12-01 16:21 ` Alan Stern @ 2023-12-01 17:42 ` Catalin Marinas 2023-12-05 18:28 ` Alan Stern 0 siblings, 1 reply; 16+ messages in thread From: Catalin Marinas @ 2023-12-01 17:42 UTC (permalink / raw) To: Alan Stern Cc: Ferry Toth, Ferry Toth, Christoph Hellwig, Hamza Mahfooz, Dan Williams, Marek Szyprowski, Andrew, Andy Shevchenko, Thorsten Leemhuis, iommu, Kernel development list, USB mailing list, Robin Murphy, Will Deacon Replying to both here. On Fri, Dec 01, 2023 at 11:21:40AM -0500, Alan Stern wrote: > On Fri, Dec 01, 2023 at 01:17:43PM +0100, Ferry Toth wrote: > > > A non-cache coherent platform would either have the kmalloc() > > > allocations aligned or it would bounce those small, unaligned buffers. > > > > It would? Or it always has? It depends on the configuration. arm64 does the bounce as it opted in to CONFIG_DMA_BOUNCE_UNALIGNED_KMALLOC. > > > So it doesn't seem right to issue a warning on x86 where kmalloc() > > > minimum alignment is 8 and not getting the warning on a non-coherent > > > platform which forces the kmalloc() alignment. > > > > If *all*non-coherent platforms implement either correct alignment or bounce > > buffer, and on (coherent) x86 we get an WARN, then it is a false alarm > > right? > > > > That is exactly my question (because I have no idea which platforms have > > non-coherent caches). > > Don't forget, not all DMA buffers are allocated by kmalloc(). A buffer > allocated by some other means might not be aligned properly and might > share a cache line with another buffer. > > Or you might have a single data structure that was allocated by > kmalloc() and then create separate DMA mappings for two members of that > structure. If the two members are in the same cache line, that would be > an error. Indeed, so to be sure we don't trip over other false alarms, we'd also need to force ARCH_DMA_MINALIGN to be at least a cache-line size. That's used in some structures to force a static alignment of various members that take DMA transfers. After that, anything reported might actually be a potential issue, not a false alarm. However, I wonder whether we'd actually hide some genuine problems. Let's say x86 gets some DMA corruption when it tries to DMA 8 bytes into two adjacent buffers because of some DMA buffer overflow, nothing to do with cache lines. You enable the DMA API debugging to see if it reports anything and it suddenly starts working because of the forced minimum alignment. -- Catalin ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Bug in add_dma_entry()'s debugging code 2023-12-01 17:42 ` Catalin Marinas @ 2023-12-05 18:28 ` Alan Stern [not found] ` <1e4df825-08fa-40cf-a565-9c0d285c9b73@gmail.com> 0 siblings, 1 reply; 16+ messages in thread From: Alan Stern @ 2023-12-05 18:28 UTC (permalink / raw) To: Catalin Marinas Cc: Ferry Toth, Ferry Toth, Christoph Hellwig, Hamza Mahfooz, Dan Williams, Marek Szyprowski, Andrew, Andy Shevchenko, Thorsten Leemhuis, iommu, Kernel development list, USB mailing list, Robin Murphy, Will Deacon On Fri, Dec 01, 2023 at 05:42:52PM +0000, Catalin Marinas wrote: > Indeed, so to be sure we don't trip over other false alarms, we'd also > need to force ARCH_DMA_MINALIGN to be at least a cache-line size. That's > used in some structures to force a static alignment of various members > that take DMA transfers. After that, anything reported might actually be > a potential issue, not a false alarm. > > However, I wonder whether we'd actually hide some genuine problems. > Let's say x86 gets some DMA corruption when it tries to DMA 8 bytes > into two adjacent buffers because of some DMA buffer overflow, nothing > to do with cache lines. You enable the DMA API debugging to see if it > reports anything and it suddenly starts working because of the forced > minimum alignment. In the long run, it may turn out that trying to detect memory usage patterns that could cause problems for architectures other than the one currently running is not workable. Certainly it is a bad idea to have a debugging infrastructure that changes the behavior of other parts of the system -- particularly when those other parts may be the ones you're trying to debug. You may end up needing to make a choice here. Which evil is lesser? Alan Stern ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <1e4df825-08fa-40cf-a565-9c0d285c9b73@gmail.com>]
* Re: Bug in add_dma_entry()'s debugging code [not found] ` <1e4df825-08fa-40cf-a565-9c0d285c9b73@gmail.com> @ 2023-12-06 16:21 ` Alan Stern 0 siblings, 0 replies; 16+ messages in thread From: Alan Stern @ 2023-12-06 16:21 UTC (permalink / raw) To: Ferry Toth Cc: Catalin Marinas, Christoph Hellwig, Hamza Mahfooz, Dan Williams, Marek Szyprowski, Andrew, Andy Shevchenko, Thorsten Leemhuis, iommu, Kernel development list, USB mailing list, Robin Murphy, Will Deacon On Wed, Dec 06, 2023 at 05:12:40PM +0100, Ferry Toth wrote: > Hi > > On 05-12-2023 19:28, Alan Stern wrote: > > In the long run, it may turn out that trying to detect memory usage > > patterns that could cause problems for architectures other than the one > > currently running is not workable. Certainly it is a bad idea to have a > > Maybe. But while debugging code on your platform it is a good thing to get > warnings for potential issues on other platforms. Oh, absolutely! It's definitely a good thing. I'm just saying that doing it may not be practical -- there may be too many false positives (as in this bug report) and false negatives. > In this case we (I) got misled by the warning message itself. That should be > easy enough to improve. I don't think so. Issuing incorrect warnings that should be ignored is a very bad idea, no matter what the wording is. After seeing a few messages like that, people learn to expect them -- and then they ignore the valid warnings along with the incorrect ones. Alan Stern ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2023-12-06 16:21 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-27 16:02 Bug in add_dma_entry()'s debugging code Alan Stern
2023-11-27 16:07 ` Christoph Hellwig
2023-11-27 16:51 ` Alan Stern
2023-11-28 13:37 ` Christoph Hellwig
2023-11-28 15:18 ` Alan Stern
2023-11-28 16:31 ` Robin Murphy
2023-11-28 16:34 ` Andy Shevchenko
2023-11-28 16:54 ` Robin Murphy
2023-11-28 17:44 ` Catalin Marinas
2023-11-30 20:08 ` Ferry Toth
2023-12-01 11:08 ` Catalin Marinas
2023-12-01 12:17 ` Ferry Toth
2023-12-01 16:21 ` Alan Stern
2023-12-01 17:42 ` Catalin Marinas
2023-12-05 18:28 ` Alan Stern
[not found] ` <1e4df825-08fa-40cf-a565-9c0d285c9b73@gmail.com>
2023-12-06 16:21 ` Alan Stern
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox