slub - extended kmalloc redzone and dma alignment

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* slub - extended kmalloc redzone and dma alignment
@ 2025-04-04  9:30 Vlastimil Babka
  2025-04-04 10:30 ` Harry Yoo
  0 siblings, 1 reply; 24+ messages in thread
From: Vlastimil Babka @ 2025-04-04  9:30 UTC (permalink / raw)
  To: Feng Tang, Peng Fan, Hyeonggon Yoo
  Cc: David Rientjes, Christoph Lameter, linux-mm@kvack.org,
	Petr Tesarik

Hi,

due to some off-list inquiry I have realized that since 946fa0dbf2d8
("mm/slub: extend redzone check to extra allocated kmalloc space than
requested")
we might be reporting false positives due to dma writing into the redzone.

It wasn't confirmed (yet) during the conversation but AFAICS it can be
happening. We have this ARCH_DMA_MINALIGN and kmalloc() will guarantee it,
but the redzone check doesn't take it into account. So if it's 8 bytes, I
think it's valid to do e.g. kmalloc(60), get a 64 byte object, do a dma
operation and it can write into the last 4 bytes. But the redzone check will
then report a buffer overflow.

Do you agree? If yes we should probably align the redzone with
ARCH_DMA_MINALIGN even if it means not reporting some non-dma buffer
overflows anymore.

Thanks,
Vlastimil

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slub - extended kmalloc redzone and dma alignment
  2025-04-04  9:30 slub - extended kmalloc redzone and dma alignment Vlastimil Babka
@ 2025-04-04 10:30 ` Harry Yoo
  2025-04-04 11:12   ` Petr Tesarik
  0 siblings, 1 reply; 24+ messages in thread
From: Harry Yoo @ 2025-04-04 10:30 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Feng Tang, Peng Fan, Hyeonggon Yoo, David Rientjes,
	Christoph Lameter, linux-mm@kvack.org, Petr Tesarik

On Fri, Apr 04, 2025 at 11:30:49AM +0200, Vlastimil Babka wrote:
> Hi,
> 
> due to some off-list inquiry I have realized that since 946fa0dbf2d8
> ("mm/slub: extend redzone check to extra allocated kmalloc space than
> requested")
> we might be reporting false positives due to dma writing into the redzone.
> 
> It wasn't confirmed (yet) during the conversation but AFAICS it can be
> happening. We have this ARCH_DMA_MINALIGN and kmalloc() will guarantee it,
> but the redzone check doesn't take it into account.

Sounds valid to me.

> So if it's 8 bytes, I think it's valid to do e.g. kmalloc(60), get a 64 byte
> object, do a dma operation and it can write into the last 4 bytes.
> But the redzone check will then report a buffer overflow.
> 
> Do you agree? If yes we should probably align the redzone with
> ARCH_DMA_MINALIGN even if it means not reporting some non-dma buffer
> overflows anymore.

Hmm then to fix it, we need to only do the extended redzone check
starting from <obj addr> + align(req_size, ARCH_DMA_MINALIGN),
instead of <obj addr> + req_size?

For instance, if req_size is 33 bytes and it's aligned to 40 bytes, a 64-byte
object will be allocated. Then the extended redzone check should only cover
the remaining 14 bytes.
 
-- 
Cheers,
Harry (formerly known as Hyeonggon)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slub - extended kmalloc redzone and dma alignment
  2025-04-04 10:30 ` Harry Yoo
@ 2025-04-04 11:12   ` Petr Tesarik
  2025-04-04 12:45     ` Vlastimil Babka
  0 siblings, 1 reply; 24+ messages in thread
From: Petr Tesarik @ 2025-04-04 11:12 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Vlastimil Babka, Feng Tang, Peng Fan, Hyeonggon Yoo,
	David Rientjes, Christoph Lameter, linux-mm@kvack.org

On Fri, 4 Apr 2025 19:30:09 +0900
Harry Yoo <harry.yoo@oracle.com> wrote:

> On Fri, Apr 04, 2025 at 11:30:49AM +0200, Vlastimil Babka wrote:
> > Hi,
> > 
> > due to some off-list inquiry I have realized that since 946fa0dbf2d8
> > ("mm/slub: extend redzone check to extra allocated kmalloc space than
> > requested")
> > we might be reporting false positives due to dma writing into the redzone.
> > 
> > It wasn't confirmed (yet) during the conversation but AFAICS it can be
> > happening. We have this ARCH_DMA_MINALIGN and kmalloc() will guarantee it,
> > but the redzone check doesn't take it into account.  
> 
> Sounds valid to me.

I'm not sure I understand your concerns.

Are you afraid that another device on the bus caches a copy of the
redzone before it was poisoned, so it overwrites the redzone with stale
data on a memory write operation? IMO that's buggy, because if a
bus-mastering device implements such cache, it is the device driver's
responsibility to flush it before starting a DMA transfer. FTR I'm not
aware of any such devices, except GPUs, but there's a whole lot to do
about CPU<->GPU coherency management, including device-specific ioctl's
to expose some gory details all the way down to userspace.

Or are you concerned about bus data word size? I would again argue that
allocating a DMA buffer with a size that is not a multiple of the
transfer size is a bug. IOW the driver must make sure the buffer size
is a multiple of 4 if it is used for 32-bit DMA transfers, or a
multiple of 8 if it is used for 64-bit DMA transfers.

Confused.

Petr T

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slub - extended kmalloc redzone and dma alignment
  2025-04-04 11:12   ` Petr Tesarik
@ 2025-04-04 12:45     ` Vlastimil Babka
  2025-04-04 13:53       ` Petr Tesarik
  0 siblings, 1 reply; 24+ messages in thread
From: Vlastimil Babka @ 2025-04-04 12:45 UTC (permalink / raw)
  To: Petr Tesarik, Harry Yoo
  Cc: Feng Tang, Peng Fan, Hyeonggon Yoo, David Rientjes,
	Christoph Lameter, linux-mm@kvack.org, Catalin Marinas

On 4/4/25 13:12, Petr Tesarik wrote:
> On Fri, 4 Apr 2025 19:30:09 +0900
> Harry Yoo <harry.yoo@oracle.com> wrote:
> 
>> On Fri, Apr 04, 2025 at 11:30:49AM +0200, Vlastimil Babka wrote:
>> > Hi,
>> > 
>> > due to some off-list inquiry I have realized that since 946fa0dbf2d8
>> > ("mm/slub: extend redzone check to extra allocated kmalloc space than
>> > requested")
>> > we might be reporting false positives due to dma writing into the redzone.
>> > 
>> > It wasn't confirmed (yet) during the conversation but AFAICS it can be
>> > happening. We have this ARCH_DMA_MINALIGN and kmalloc() will guarantee it,
>> > but the redzone check doesn't take it into account.  
>> 
>> Sounds valid to me.
> 
> I'm not sure I understand your concerns.

I'd be happy to be proven wrong and you're more familiar with DMA details
than me :)

> Are you afraid that another device on the bus caches a copy of the
> redzone before it was poisoned, so it overwrites the redzone with stale
> data on a memory write operation? IMO that's buggy, because if a
> bus-mastering device implements such cache, it is the device driver's
> responsibility to flush it before starting a DMA transfer. FTR I'm not
> aware of any such devices, except GPUs, but there's a whole lot to do
> about CPU<->GPU coherency management, including device-specific ioctl's
> to expose some gory details all the way down to userspace.

OK, guess not that.

> Or are you concerned about bus data word size? I would again argue that
> allocating a DMA buffer with a size that is not a multiple of the
> transfer size is a bug. IOW the driver must make sure the buffer size
> is a multiple of 4 if it is used for 32-bit DMA transfers, or a
> multiple of 8 if it is used for 64-bit DMA transfers.

Yeah I think it's that, and I thought drivers don't need to care themselves
because ARCH_DMA_MINALIGN means kmalloc() layer provides that guarantee
itself. I also remember this series (incidentally just recently the
discussion was revived).

https://lore.kernel.org/all/20230612153201.554742-1-catalin.marinas@arm.com/

> Confused.
> 
> Petr T



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slub - extended kmalloc redzone and dma alignment
  2025-04-04 12:45     ` Vlastimil Babka
@ 2025-04-04 13:53       ` Petr Tesarik
  2025-04-06 14:02         ` Feng Tang
  2025-04-07  7:45         ` Vlastimil Babka
  0 siblings, 2 replies; 24+ messages in thread
From: Petr Tesarik @ 2025-04-04 13:53 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Feng Tang, Peng Fan, Hyeonggon Yoo, David Rientjes,
	Christoph Lameter, linux-mm@kvack.org, Catalin Marinas

On Fri, 4 Apr 2025 14:45:14 +0200
Vlastimil Babka <vbabka@suse.cz> wrote:

> On 4/4/25 13:12, Petr Tesarik wrote:
> > On Fri, 4 Apr 2025 19:30:09 +0900
> > Harry Yoo <harry.yoo@oracle.com> wrote:
> >   
> >> On Fri, Apr 04, 2025 at 11:30:49AM +0200, Vlastimil Babka wrote:  
> >> > Hi,
> >> > 
> >> > due to some off-list inquiry I have realized that since 946fa0dbf2d8
> >> > ("mm/slub: extend redzone check to extra allocated kmalloc space than
> >> > requested")
> >> > we might be reporting false positives due to dma writing into the redzone.
> >> > 
> >> > It wasn't confirmed (yet) during the conversation but AFAICS it can be
> >> > happening. We have this ARCH_DMA_MINALIGN and kmalloc() will guarantee it,
> >> > but the redzone check doesn't take it into account.    
> >> 
> >> Sounds valid to me.  
> > 
> > I'm not sure I understand your concerns.  
> 
> I'd be happy to be proven wrong and you're more familiar with DMA details
> than me :)
> 
> > Are you afraid that another device on the bus caches a copy of the
> > redzone before it was poisoned, so it overwrites the redzone with stale
> > data on a memory write operation? IMO that's buggy, because if a
> > bus-mastering device implements such cache, it is the device driver's
> > responsibility to flush it before starting a DMA transfer. FTR I'm not
> > aware of any such devices, except GPUs, but there's a whole lot to do
> > about CPU<->GPU coherency management, including device-specific ioctl's
> > to expose some gory details all the way down to userspace.  
> 
> OK, guess not that.
> 
> > Or are you concerned about bus data word size? I would again argue that
> > allocating a DMA buffer with a size that is not a multiple of the
> > transfer size is a bug. IOW the driver must make sure the buffer size
> > is a multiple of 4 if it is used for 32-bit DMA transfers, or a
> > multiple of 8 if it is used for 64-bit DMA transfers.  
> 
> Yeah I think it's that, and I thought drivers don't need to care themselves
> because ARCH_DMA_MINALIGN means kmalloc() layer provides that guarantee
> itself. I also remember this series (incidentally just recently the
> discussion was revived).
> 
> https://lore.kernel.org/all/20230612153201.554742-1-catalin.marinas@arm.com/

I can remember this series, as well as my confusion why 192-byte
kmalloc caches were missing on arm64.

Nevertheless, I believe ARCH_DMA_MINALIGN is required to avoid putting
a DMA buffer on the same cache line as some other data that might be
_written_ by the CPU while the corresponding main memory is modified by
another bus-mastering device.

Consider this layout:

 ... | DMA buffer | other data | ...
     ^                         ^
     +-------------------------+-- cache line boundaries

When you prepare for DMA, you make sure that the DMA buffer is not
cached by the CPU, so you flush the cache line (from all levels). Then
you tell the device to write into the DMA buffer. However, before the
device finishes the DMA transaction, the CPU accesses "other data",
loading this cache line from main memory with partial results. Worse,
if the CPU writes to "other data", it may write the cache line back
into main memory, racing with the device writing to DMA buffer, and you
end up with corrupted data in DMA buffer.

But redzone poisoning should happen long before the DMA buffer cache
line is flushed. The device will not overwrite it unless it was given
wrong buffer length for the transaction, but then that would be a bug
that I'd rather detect.

@Catalin: I can see you're already in Cc. If I'm still missing
something, please, correct me.

Petr T


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slub - extended kmalloc redzone and dma alignment
  2025-04-04 13:53       ` Petr Tesarik
@ 2025-04-06 14:02         ` Feng Tang
  2025-04-07  7:21           ` Feng Tang
  2025-04-07  7:45         ` Vlastimil Babka
  1 sibling, 1 reply; 24+ messages in thread
From: Feng Tang @ 2025-04-06 14:02 UTC (permalink / raw)
  To: Petr Tesarik
  Cc: Vlastimil Babka, Harry Yoo, Peng Fan, Hyeonggon Yoo,
	David Rientjes, Christoph Lameter, linux-mm@kvack.org,
	Catalin Marinas

On Fri, Apr 04, 2025 at 03:53:03PM +0200, Petr Tesarik wrote:
> On Fri, 4 Apr 2025 14:45:14 +0200
> Vlastimil Babka <vbabka@suse.cz> wrote:
> 
> > On 4/4/25 13:12, Petr Tesarik wrote:
> > > On Fri, 4 Apr 2025 19:30:09 +0900
> > > Harry Yoo <harry.yoo@oracle.com> wrote:
> > >   
> > >> On Fri, Apr 04, 2025 at 11:30:49AM +0200, Vlastimil Babka wrote:  
> > >> > Hi,
> > >> > 
> > >> > due to some off-list inquiry I have realized that since 946fa0dbf2d8
> > >> > ("mm/slub: extend redzone check to extra allocated kmalloc space than
> > >> > requested")
> > >> > we might be reporting false positives due to dma writing into the redzone.
> > >> > 
> > >> > It wasn't confirmed (yet) during the conversation but AFAICS it can be
> > >> > happening. We have this ARCH_DMA_MINALIGN and kmalloc() will guarantee it,
> > >> > but the redzone check doesn't take it into account.    
> > >> 
> > >> Sounds valid to me.  
> > > 
> > > I'm not sure I understand your concerns.  
> > 
> > I'd be happy to be proven wrong and you're more familiar with DMA details
> > than me :)
> > 
> > > Are you afraid that another device on the bus caches a copy of the
> > > redzone before it was poisoned, so it overwrites the redzone with stale
> > > data on a memory write operation? IMO that's buggy, because if a
> > > bus-mastering device implements such cache, it is the device driver's
> > > responsibility to flush it before starting a DMA transfer. FTR I'm not
> > > aware of any such devices, except GPUs, but there's a whole lot to do
> > > about CPU<->GPU coherency management, including device-specific ioctl's
> > > to expose some gory details all the way down to userspace.  
> > 
> > OK, guess not that.
> > 
> > > Or are you concerned about bus data word size? I would again argue that
> > > allocating a DMA buffer with a size that is not a multiple of the
> > > transfer size is a bug. IOW the driver must make sure the buffer size
> > > is a multiple of 4 if it is used for 32-bit DMA transfers, or a
> > > multiple of 8 if it is used for 64-bit DMA transfers.  
> > 
> > Yeah I think it's that, and I thought drivers don't need to care themselves
> > because ARCH_DMA_MINALIGN means kmalloc() layer provides that guarantee
> > itself. I also remember this series (incidentally just recently the
> > discussion was revived).
> > 
> > https://lore.kernel.org/all/20230612153201.554742-1-catalin.marinas@arm.com/
> 
> I can remember this series, as well as my confusion why 192-byte
> kmalloc caches were missing on arm64.
> 
> Nevertheless, I believe ARCH_DMA_MINALIGN is required to avoid putting
> a DMA buffer on the same cache line as some other data that might be
> _written_ by the CPU while the corresponding main memory is modified by
> another bus-mastering device.
> 
> Consider this layout:
> 
>  ... | DMA buffer | other data | ...
>      ^                         ^
>      +-------------------------+-- cache line boundaries
> 
> When you prepare for DMA, you make sure that the DMA buffer is not
> cached by the CPU, so you flush the cache line (from all levels). Then
> you tell the device to write into the DMA buffer. However, before the
> device finishes the DMA transaction, the CPU accesses "other data",
> loading this cache line from main memory with partial results. Worse,
> if the CPU writes to "other data", it may write the cache line back
> into main memory, racing with the device writing to DMA buffer, and you
> end up with corrupted data in DMA buffer.
> 
> But redzone poisoning should happen long before the DMA buffer cache
> line is flushed. The device will not overwrite it unless it was given
> wrong buffer length for the transaction, but then that would be a bug
> that I'd rather detect.

I alaso tend to think it's better for slub to detect these kind of DMA
'overflow'. We've added slub kunit test case for these in commmit 
6cd6d33ca41f ("mm/slub, kunit: Add a test case for kmalloc redzone check),
which was inspired by a similar DMA related bug as described in
commit 120ee599b5bf ("staging: octeon-usb: prevent memory corruption")

Thanks,
Feng

> 
> @Catalin: I can see you're already in Cc. If I'm still missing
> something, please, correct me.
> 
> Petr T


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slub - extended kmalloc redzone and dma alignment
  2025-04-06 14:02         ` Feng Tang
@ 2025-04-07  7:21           ` Feng Tang
  2025-04-07  7:54             ` Vlastimil Babka
  0 siblings, 1 reply; 24+ messages in thread
From: Feng Tang @ 2025-04-07  7:21 UTC (permalink / raw)
  To: Petr Tesarik
  Cc: Vlastimil Babka, Harry Yoo, Peng Fan, Hyeonggon Yoo,
	David Rientjes, Christoph Lameter, linux-mm@kvack.org,
	Catalin Marinas

On Sun, Apr 06, 2025 at 10:02:40PM +0800, Feng Tang wrote:
[...]
> > I can remember this series, as well as my confusion why 192-byte
> > kmalloc caches were missing on arm64.
> > 
> > Nevertheless, I believe ARCH_DMA_MINALIGN is required to avoid putting
> > a DMA buffer on the same cache line as some other data that might be
> > _written_ by the CPU while the corresponding main memory is modified by
> > another bus-mastering device.
> > 
> > Consider this layout:
> > 
> >  ... | DMA buffer | other data | ...
> >      ^                         ^
> >      +-------------------------+-- cache line boundaries
> > 
> > When you prepare for DMA, you make sure that the DMA buffer is not
> > cached by the CPU, so you flush the cache line (from all levels). Then
> > you tell the device to write into the DMA buffer. However, before the
> > device finishes the DMA transaction, the CPU accesses "other data",
> > loading this cache line from main memory with partial results. Worse,
> > if the CPU writes to "other data", it may write the cache line back
> > into main memory, racing with the device writing to DMA buffer, and you
> > end up with corrupted data in DMA buffer.
> > 
> > But redzone poisoning should happen long before the DMA buffer cache
> > line is flushed. The device will not overwrite it unless it was given
> > wrong buffer length for the transaction, but then that would be a bug
> > that I'd rather detect.
> 
> I alaso tend to think it's better for slub to detect these kind of DMA
> 'overflow'. We've added slub kunit test case for these in commmit 
> 6cd6d33ca41f ("mm/slub, kunit: Add a test case for kmalloc redzone check),
> which was inspired by a similar DMA related bug as described in
> commit 120ee599b5bf ("staging: octeon-usb: prevent memory corruption")

I'm not familiar with DMA stuff, but Vlastimil's idea does make it
easier for driver developer to write a driver to be used on different 
ARCHs, which have different DMA alignment requirement. Say if the minimal
safe size is 8 bytes, the driver can just request 8 bytes and
ARCH_DMA_MINALIGN will automatically chose the right size for it, which
can save memory for ARCHs with smaller alignment requirement. Meanwhile
it does sacrifice part of the redzone check ability, so I don't have
preference here :)

Thanks,
Feng


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slub - extended kmalloc redzone and dma alignment
  2025-04-04 13:53       ` Petr Tesarik
  2025-04-06 14:02         ` Feng Tang
@ 2025-04-07  7:45         ` Vlastimil Babka
  1 sibling, 0 replies; 24+ messages in thread
From: Vlastimil Babka @ 2025-04-07  7:45 UTC (permalink / raw)
  To: Petr Tesarik
  Cc: Harry Yoo, Feng Tang, Peng Fan, Hyeonggon Yoo, David Rientjes,
	Christoph Lameter, linux-mm@kvack.org, Catalin Marinas

On 4/4/25 15:53, Petr Tesarik wrote:
> On Fri, 4 Apr 2025 14:45:14 +0200
> Vlastimil Babka <vbabka@suse.cz> wrote:
> 
>> 
>> https://lore.kernel.org/all/20230612153201.554742-1-catalin.marinas@arm.com/
> 
> I can remember this series, as well as my confusion why 192-byte
> kmalloc caches were missing on arm64.
> 
> Nevertheless, I believe ARCH_DMA_MINALIGN is required to avoid putting
> a DMA buffer on the same cache line as some other data that might be
> _written_ by the CPU while the corresponding main memory is modified by
> another bus-mastering device.
> 
> Consider this layout:
> 
>  ... | DMA buffer | other data | ...
>      ^                         ^
>      +-------------------------+-- cache line boundaries
> 
> When you prepare for DMA, you make sure that the DMA buffer is not
> cached by the CPU, so you flush the cache line (from all levels). Then
> you tell the device to write into the DMA buffer. However, before the
> device finishes the DMA transaction, the CPU accesses "other data",
> loading this cache line from main memory with partial results. Worse,

I'd expect loading has to be safe, somehow. Otherwise even without accessing
the cache line directly (either the buffer or other data part) you may be
accessing nearby cachelines, triggering a prefetch into the CPU cache. Is
this handled by another flush after the DMA operation is done? In any case
it doesn't affect the redzoning.

> if the CPU writes to "other data", it may write the cache line back
> into main memory, racing with the device writing to DMA buffer, and you
> end up with corrupted data in DMA buffer.
> 
> But redzone poisoning should happen long before the DMA buffer cache
> line is flushed. The device will not overwrite it unless it was given
> wrong buffer length for the transaction, but then that would be a bug
> that I'd rather detect.

Thanks for the explanation, it makes sense to me and thus the extended
redzoning should be safe, unless someone disputes your view. Sorry for the
noise!

> @Catalin: I can see you're already in Cc. If I'm still missing
> something, please, correct me.
> 
> Petr T



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slub - extended kmalloc redzone and dma alignment
  2025-04-07  7:21           ` Feng Tang
@ 2025-04-07  7:54             ` Vlastimil Babka
  2025-04-07  9:50               ` Petr Tesarik
  2025-04-07 17:12               ` Catalin Marinas
  0 siblings, 2 replies; 24+ messages in thread
From: Vlastimil Babka @ 2025-04-07  7:54 UTC (permalink / raw)
  To: Feng Tang, Petr Tesarik
  Cc: Harry Yoo, Peng Fan, Hyeonggon Yoo, David Rientjes,
	Christoph Lameter, linux-mm@kvack.org, Catalin Marinas

On 4/7/25 09:21, Feng Tang wrote:
> On Sun, Apr 06, 2025 at 10:02:40PM +0800, Feng Tang wrote:
> [...]
>> > I can remember this series, as well as my confusion why 192-byte
>> > kmalloc caches were missing on arm64.
>> > 
>> > Nevertheless, I believe ARCH_DMA_MINALIGN is required to avoid putting
>> > a DMA buffer on the same cache line as some other data that might be
>> > _written_ by the CPU while the corresponding main memory is modified by
>> > another bus-mastering device.
>> > 
>> > Consider this layout:
>> > 
>> >  ... | DMA buffer | other data | ...
>> >      ^                         ^
>> >      +-------------------------+-- cache line boundaries
>> > 
>> > When you prepare for DMA, you make sure that the DMA buffer is not
>> > cached by the CPU, so you flush the cache line (from all levels). Then
>> > you tell the device to write into the DMA buffer. However, before the
>> > device finishes the DMA transaction, the CPU accesses "other data",
>> > loading this cache line from main memory with partial results. Worse,
>> > if the CPU writes to "other data", it may write the cache line back
>> > into main memory, racing with the device writing to DMA buffer, and you
>> > end up with corrupted data in DMA buffer.
>> > 
>> > But redzone poisoning should happen long before the DMA buffer cache
>> > line is flushed. The device will not overwrite it unless it was given
>> > wrong buffer length for the transaction, but then that would be a bug
>> > that I'd rather detect.
>> 
>> I alaso tend to think it's better for slub to detect these kind of DMA
>> 'overflow'. We've added slub kunit test case for these in commmit 
>> 6cd6d33ca41f ("mm/slub, kunit: Add a test case for kmalloc redzone check),
>> which was inspired by a similar DMA related bug as described in
>> commit 120ee599b5bf ("staging: octeon-usb: prevent memory corruption")

OK so besides Petr's explanation that was about cache (in)coherency and is
AFAIK tied to ARCH_DMA_MINALIGN, there is possibility of DMA that will
really write garbage beyond the buffer that's not word aligned. Can we
assume that this was really a bug in the usage and ensuring word alignment
(not ARCH_DMA_MINALIGN alignment) is required from a different layer than
kmalloc() itself? In that case it would be best to keep the reporting as it is.

> I'm not familiar with DMA stuff, but Vlastimil's idea does make it
> easier for driver developer to write a driver to be used on different 
> ARCHs, which have different DMA alignment requirement. Say if the minimal
> safe size is 8 bytes, the driver can just request 8 bytes and
> ARCH_DMA_MINALIGN will automatically chose the right size for it, which
> can save memory for ARCHs with smaller alignment requirement. Meanwhile
> it does sacrifice part of the redzone check ability, so I don't have
> preference here :)

Let's clarify first who's expected to ensure the word alignment for DMA, if
it's not kmalloc() then I'd rather resist moving it there :)

Thanks,
Vlastimil

> Thanks,
> Feng



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slub - extended kmalloc redzone and dma alignment
  2025-04-07  7:54             ` Vlastimil Babka
@ 2025-04-07  9:50               ` Petr Tesarik
  2025-04-07 17:12               ` Catalin Marinas
  1 sibling, 0 replies; 24+ messages in thread
From: Petr Tesarik @ 2025-04-07  9:50 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Feng Tang, Harry Yoo, Peng Fan, Hyeonggon Yoo, David Rientjes,
	Christoph Lameter, linux-mm@kvack.org, Catalin Marinas

On Mon, 7 Apr 2025 09:54:41 +0200
Vlastimil Babka <vbabka@suse.cz> wrote:

> On 4/7/25 09:21, Feng Tang wrote:
> > On Sun, Apr 06, 2025 at 10:02:40PM +0800, Feng Tang wrote:
> > [...]  
> >> > I can remember this series, as well as my confusion why 192-byte
> >> > kmalloc caches were missing on arm64.
> >> > 
> >> > Nevertheless, I believe ARCH_DMA_MINALIGN is required to avoid putting
> >> > a DMA buffer on the same cache line as some other data that might be
> >> > _written_ by the CPU while the corresponding main memory is modified by
> >> > another bus-mastering device.
> >> > 
> >> > Consider this layout:
> >> > 
> >> >  ... | DMA buffer | other data | ...
> >> >      ^                         ^
> >> >      +-------------------------+-- cache line boundaries
> >> > 
> >> > When you prepare for DMA, you make sure that the DMA buffer is not
> >> > cached by the CPU, so you flush the cache line (from all levels). Then
> >> > you tell the device to write into the DMA buffer. However, before the
> >> > device finishes the DMA transaction, the CPU accesses "other data",
> >> > loading this cache line from main memory with partial results. Worse,
> >> > if the CPU writes to "other data", it may write the cache line back
> >> > into main memory, racing with the device writing to DMA buffer, and you
> >> > end up with corrupted data in DMA buffer.
> >> > 
> >> > But redzone poisoning should happen long before the DMA buffer cache
> >> > line is flushed. The device will not overwrite it unless it was given
> >> > wrong buffer length for the transaction, but then that would be a bug
> >> > that I'd rather detect.  
> >> 
> >> I alaso tend to think it's better for slub to detect these kind of DMA
> >> 'overflow'. We've added slub kunit test case for these in commmit 
> >> 6cd6d33ca41f ("mm/slub, kunit: Add a test case for kmalloc redzone check),
> >> which was inspired by a similar DMA related bug as described in
> >> commit 120ee599b5bf ("staging: octeon-usb: prevent memory corruption")  
> 
> OK so besides Petr's explanation that was about cache (in)coherency and is
> AFAIK tied to ARCH_DMA_MINALIGN, there is possibility of DMA that will
> really write garbage beyond the buffer that's not word aligned. Can we
> assume that this was really a bug in the usage and ensuring word alignment
> (not ARCH_DMA_MINALIGN alignment) is required from a different layer than
> kmalloc() itself? In that case it would be best to keep the reporting as it is.
>
> > I'm not familiar with DMA stuff, but Vlastimil's idea does make it
> > easier for driver developer to write a driver to be used on different 
> > ARCHs, which have different DMA alignment requirement. Say if the minimal
> > safe size is 8 bytes, the driver can just request 8 bytes and
> > ARCH_DMA_MINALIGN will automatically chose the right size for it, which
> > can save memory for ARCHs with smaller alignment requirement. Meanwhile
> > it does sacrifice part of the redzone check ability, so I don't have
> > preference here :)  
> 
> Let's clarify first who's expected to ensure the word alignment for DMA, if
> it's not kmalloc() then I'd rather resist moving it there :)

I think it's clear. The granularity is mandated by the bus protocol.
It's not like the wires can add padding bits to a transaction by means
of magic. ;-) If a device becomes the bus master, a specific chip
drives the bus, and this chip must know about word size. In fact, the
hardware registers on that chip usually specify the transfer size in
units of bus words, not in bytes. And these hardware registers are
programmed by the corresponding device driver, so it must be the device
driver's job to ensure proper alignment.

Petr T


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slub - extended kmalloc redzone and dma alignment
  2025-04-07  7:54             ` Vlastimil Babka
  2025-04-07  9:50               ` Petr Tesarik
@ 2025-04-07 17:12               ` Catalin Marinas
  2025-04-08  5:27                 ` Petr Tesarik
  1 sibling, 1 reply; 24+ messages in thread
From: Catalin Marinas @ 2025-04-07 17:12 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Feng Tang, Petr Tesarik, Harry Yoo, Peng Fan, Hyeonggon Yoo,
	David Rientjes, Christoph Lameter, linux-mm@kvack.org

Hi Vlastimil, Feng,

Thanks for looping me in. I'm just catching up with this thread.

On Mon, Apr 07, 2025 at 09:54:41AM +0200, Vlastimil Babka wrote:
> On 4/7/25 09:21, Feng Tang wrote:
> > On Sun, Apr 06, 2025 at 10:02:40PM +0800, Feng Tang wrote:
> > [...]
> >> > I can remember this series, as well as my confusion why 192-byte
> >> > kmalloc caches were missing on arm64.
> >> > 
> >> > Nevertheless, I believe ARCH_DMA_MINALIGN is required to avoid putting
> >> > a DMA buffer on the same cache line as some other data that might be
> >> > _written_ by the CPU while the corresponding main memory is modified by
> >> > another bus-mastering device.
> >> > 
> >> > Consider this layout:
> >> > 
> >> >  ... | DMA buffer | other data | ...
> >> >      ^                         ^
> >> >      +-------------------------+-- cache line boundaries
> >> > 
> >> > When you prepare for DMA, you make sure that the DMA buffer is not
> >> > cached by the CPU, so you flush the cache line (from all levels). Then
> >> > you tell the device to write into the DMA buffer. However, before the
> >> > device finishes the DMA transaction, the CPU accesses "other data",
> >> > loading this cache line from main memory with partial results. Worse,
> >> > if the CPU writes to "other data", it may write the cache line back
> >> > into main memory, racing with the device writing to DMA buffer, and you
> >> > end up with corrupted data in DMA buffer.

Yes, cache evictions from 'other data; can override the DMA. Another
problem, when the DMA completed, the kernel does a cache invalidation to
remove any speculatively loaded cache lines from the DMA buffer but that
would also invalidate 'other data', potentially corrupting it if it was
dirty.

So it's not safe to have DMA into buffers less than ARCH_DMA_MINALIGN
(and unaligned). What I did with reducing the minimum kmalloc()
alignment was to force bouncing via swiotlb if the size passed to the
DMA API is small. It may end up bouncing buffers that did not originate
from kmalloc() or have proper alignment (with padding) but that's some
heuristics we were willing to accept to be able to use small kmalloc()
caches on arm64 - see dma_kmalloc_needs_bounce().

Does redzoning apply to kmalloc() or kmem_cache_create() (or both)? I
haven't checked yet but if the red zone is within ARCH_DMA_MINALIGN (or
rather dma_get_cache_alignment()), we could have issues with either
corrupting the DMA buffer or the red zone. Note that this only applies
to DMA_FROM_DEVICE or DMA_BIDIRECTIONAL.

> >> > But redzone poisoning should happen long before the DMA buffer cache
> >> > line is flushed. The device will not overwrite it unless it was given
> >> > wrong buffer length for the transaction, but then that would be a bug
> >> > that I'd rather detect.
> >> 
> >> I alaso tend to think it's better for slub to detect these kind of DMA
> >> 'overflow'. We've added slub kunit test case for these in commmit 
> >> 6cd6d33ca41f ("mm/slub, kunit: Add a test case for kmalloc redzone check),
> >> which was inspired by a similar DMA related bug as described in
> >> commit 120ee599b5bf ("staging: octeon-usb: prevent memory corruption")
> 
> OK so besides Petr's explanation that was about cache (in)coherency and is
> AFAIK tied to ARCH_DMA_MINALIGN, there is possibility of DMA that will
> really write garbage beyond the buffer that's not word aligned. Can we
> assume that this was really a bug in the usage and ensuring word alignment
> (not ARCH_DMA_MINALIGN alignment) is required from a different layer than
> kmalloc() itself? In that case it would be best to keep the reporting as it is.

dma_direct_map_page() for example bounces the DMA if it detects a small
size, just in case it came from kmalloc(). Similarly in the iommu code -
dev_use_sg_swiotlb().

> > I'm not familiar with DMA stuff, but Vlastimil's idea does make it
> > easier for driver developer to write a driver to be used on different 
> > ARCHs, which have different DMA alignment requirement. Say if the minimal
> > safe size is 8 bytes, the driver can just request 8 bytes and
> > ARCH_DMA_MINALIGN will automatically chose the right size for it, which
> > can save memory for ARCHs with smaller alignment requirement. Meanwhile
> > it does sacrifice part of the redzone check ability, so I don't have
> > preference here :)
> 
> Let's clarify first who's expected to ensure the word alignment for DMA, if
> it's not kmalloc() then I'd rather resist moving it there :)

In theory, the DMA API should handle the alignment as I tried to remove
it from the kmalloc() code.

With kmem_cache_create() (or kmalloc() as well), if the object size is
not cacheline-aligned, is there risk of redzoning around the object
without any alignment restrictions? The logic in
dma_kmalloc_size_aligned() would fail for sufficiently large buffers but
with unaligned red zone around the object. Not sure how to fix this
though in the DMA API. At least for kmem_cache_create() one can pass
SLAB_HWCACHE_ALIGN and it will force the alignment. I need to check what
redzoning does in this case (won't have time until tomorrow, I just got
back from holiday and lots of emails to go through).

-- 
Catalin

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slub - extended kmalloc redzone and dma alignment
  2025-04-07 17:12               ` Catalin Marinas
@ 2025-04-08  5:27                 ` Petr Tesarik
  2025-04-08 15:07                   ` Catalin Marinas
  0 siblings, 1 reply; 24+ messages in thread
From: Petr Tesarik @ 2025-04-08  5:27 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Vlastimil Babka, Feng Tang, Harry Yoo, Peng Fan, Hyeonggon Yoo,
	David Rientjes, Christoph Lameter, linux-mm@kvack.org

On Mon, 7 Apr 2025 18:12:09 +0100
Catalin Marinas <catalin.marinas@arm.com> wrote:

> Hi Vlastimil, Feng,
> 
> Thanks for looping me in. I'm just catching up with this thread.
> 
> On Mon, Apr 07, 2025 at 09:54:41AM +0200, Vlastimil Babka wrote:
> > On 4/7/25 09:21, Feng Tang wrote:  
> > > On Sun, Apr 06, 2025 at 10:02:40PM +0800, Feng Tang wrote:
> > > [...]  
> > >> > I can remember this series, as well as my confusion why
> > >> > 192-byte kmalloc caches were missing on arm64.
> > >> > 
> > >> > Nevertheless, I believe ARCH_DMA_MINALIGN is required to avoid
> > >> > putting a DMA buffer on the same cache line as some other data
> > >> > that might be _written_ by the CPU while the corresponding
> > >> > main memory is modified by another bus-mastering device.
> > >> > 
> > >> > Consider this layout:
> > >> > 
> > >> >  ... | DMA buffer | other data | ...
> > >> >      ^                         ^
> > >> >      +-------------------------+-- cache line boundaries
> > >> > 
> > >> > When you prepare for DMA, you make sure that the DMA buffer is
> > >> > not cached by the CPU, so you flush the cache line (from all
> > >> > levels). Then you tell the device to write into the DMA
> > >> > buffer. However, before the device finishes the DMA
> > >> > transaction, the CPU accesses "other data", loading this cache
> > >> > line from main memory with partial results. Worse, if the CPU
> > >> > writes to "other data", it may write the cache line back into
> > >> > main memory, racing with the device writing to DMA buffer, and
> > >> > you end up with corrupted data in DMA buffer.  
> 
> Yes, cache evictions from 'other data; can override the DMA. Another
> problem, when the DMA completed, the kernel does a cache invalidation
> to remove any speculatively loaded cache lines from the DMA buffer
> but that would also invalidate 'other data', potentially corrupting
> it if it was dirty.
> 
> So it's not safe to have DMA into buffers less than ARCH_DMA_MINALIGN
> (and unaligned).

It's not safe to DMA into buffers that share a CPU cache line with other
data, which could be before or after the DMA buffer, of course.

> What I did with reducing the minimum kmalloc()
> alignment was to force bouncing via swiotlb if the size passed to the
> DMA API is small. It may end up bouncing buffers that did not
> originate from kmalloc() or have proper alignment (with padding) but
> that's some heuristics we were willing to accept to be able to use
> small kmalloc() caches on arm64 - see dma_kmalloc_needs_bounce().
> 
> Does redzoning apply to kmalloc() or kmem_cache_create() (or both)? I
> haven't checked yet but if the red zone is within ARCH_DMA_MINALIGN
> (or rather dma_get_cache_alignment()), we could have issues with
> either corrupting the DMA buffer or the red zone. [...]

I'm sorry if I'm being thick, but IIUC the red zone does not have to be
protected. Yes, we might miss red zone corruption if it happens to race
with a DMA transaction, but I have assumed that this is permissible. I
regard the red zone as a useful debugging tool, not a safety measure
that is guaranteed to detect any write beyond the buffer end.

>[...]
> > > I'm not familiar with DMA stuff, but Vlastimil's idea does make it
> > > easier for driver developer to write a driver to be used on
> > > different ARCHs, which have different DMA alignment requirement.
> > > Say if the minimal safe size is 8 bytes, the driver can just
> > > request 8 bytes and ARCH_DMA_MINALIGN will automatically chose
> > > the right size for it, which can save memory for ARCHs with
> > > smaller alignment requirement. Meanwhile it does sacrifice part
> > > of the redzone check ability, so I don't have preference here :)  
> > 
> > Let's clarify first who's expected to ensure the word alignment for
> > DMA, if it's not kmalloc() then I'd rather resist moving it there
> > :)  
> 
> In theory, the DMA API should handle the alignment as I tried to
> remove it from the kmalloc() code.

Are we talking about the alignment of the starting address, or buffer
size, or both?

> With kmem_cache_create() (or kmalloc() as well), if the object size is
> not cacheline-aligned, is there risk of redzoning around the object
> without any alignment restrictions? The logic in
> dma_kmalloc_size_aligned() would fail for sufficiently large buffers
> but with unaligned red zone around the object.

This red zone is the extra memory that is normally wasted by kmalloc()
rounding up the requested size to the bucket size.
But dma_kmalloc_size_aligned() already uses kmalloc_size_roundup(size),
so it seems to be covered.

Petr T


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slub - extended kmalloc redzone and dma alignment
  2025-04-08  5:27                 ` Petr Tesarik
@ 2025-04-08 15:07                   ` Catalin Marinas
  2025-04-09  8:39                     ` Petr Tesarik
  2025-04-09  8:51                     ` Vlastimil Babka
  0 siblings, 2 replies; 24+ messages in thread
From: Catalin Marinas @ 2025-04-08 15:07 UTC (permalink / raw)
  To: Petr Tesarik
  Cc: Vlastimil Babka, Feng Tang, Harry Yoo, Peng Fan, Hyeonggon Yoo,
	David Rientjes, Christoph Lameter, linux-mm@kvack.org

On Tue, Apr 08, 2025 at 07:27:32AM +0200, Petr Tesarik wrote:
> On Mon, 7 Apr 2025 18:12:09 +0100
> Catalin Marinas <catalin.marinas@arm.com> wrote:
> > Thanks for looping me in. I'm just catching up with this thread.
> > 
> > On Mon, Apr 07, 2025 at 09:54:41AM +0200, Vlastimil Babka wrote:
> > > On 4/7/25 09:21, Feng Tang wrote:  
> > > > On Sun, Apr 06, 2025 at 10:02:40PM +0800, Feng Tang wrote:
> > > > [...]  
> > > >> > I can remember this series, as well as my confusion why
> > > >> > 192-byte kmalloc caches were missing on arm64.
> > > >> > 
> > > >> > Nevertheless, I believe ARCH_DMA_MINALIGN is required to avoid
> > > >> > putting a DMA buffer on the same cache line as some other data
> > > >> > that might be _written_ by the CPU while the corresponding
> > > >> > main memory is modified by another bus-mastering device.
> > > >> > 
> > > >> > Consider this layout:
> > > >> > 
> > > >> >  ... | DMA buffer | other data | ...
> > > >> >      ^                         ^
> > > >> >      +-------------------------+-- cache line boundaries
> > > >> > 
> > > >> > When you prepare for DMA, you make sure that the DMA buffer is
> > > >> > not cached by the CPU, so you flush the cache line (from all
> > > >> > levels). Then you tell the device to write into the DMA
> > > >> > buffer. However, before the device finishes the DMA
> > > >> > transaction, the CPU accesses "other data", loading this cache
> > > >> > line from main memory with partial results. Worse, if the CPU
> > > >> > writes to "other data", it may write the cache line back into
> > > >> > main memory, racing with the device writing to DMA buffer, and
> > > >> > you end up with corrupted data in DMA buffer.  
> > 
> > Yes, cache evictions from 'other data; can override the DMA. Another
> > problem, when the DMA completed, the kernel does a cache invalidation
> > to remove any speculatively loaded cache lines from the DMA buffer
> > but that would also invalidate 'other data', potentially corrupting
> > it if it was dirty.
> > 
> > So it's not safe to have DMA into buffers less than ARCH_DMA_MINALIGN
> > (and unaligned).
> 
> It's not safe to DMA into buffers that share a CPU cache line with other
> data, which could be before or after the DMA buffer, of course.

Was the original problem reported for an arm64 platform? It wasn't clear
to me from the thread.

For arm64, the only problem is if the other data is being modified
_while_ the transfer is taking place. Otherwise when mapping the buffer
for device, the kernel cleans the caches and writes other data to RAM.
See arch_sync_dma_for_device(). This is non-destructive w.r.t. the data
in both the DMA buffer and the red zone.

After the transfer (FROM_DEVICE), the arch_sync_dma_for_cpu()
invalidates the caches, including other data, but since they were
previously written to RAM in the for_device case, they'd be read into
the cache on access without any corruption.

Of course, this assumes that the device keeps within the limits and does
not write beyond the DMA buffer into the red zone. If it does, the
buffer overflow warning is valid.

While I think we are ok for arm64, other architectures may invalidate
the caches in the arch_sync_dma_for_device() which could discard the red
zone data. A quick grep for arch_sync_dma_for_device() shows several
architectures invalidating the caches in the FROM_DEVICE case.

> > What I did with reducing the minimum kmalloc()
> > alignment was to force bouncing via swiotlb if the size passed to the
> > DMA API is small. It may end up bouncing buffers that did not
> > originate from kmalloc() or have proper alignment (with padding) but
> > that's some heuristics we were willing to accept to be able to use
> > small kmalloc() caches on arm64 - see dma_kmalloc_needs_bounce().
> > 
> > Does redzoning apply to kmalloc() or kmem_cache_create() (or both)? I
> > haven't checked yet but if the red zone is within ARCH_DMA_MINALIGN
> > (or rather dma_get_cache_alignment()), we could have issues with
> > either corrupting the DMA buffer or the red zone. [...]
> 
> I'm sorry if I'm being thick, but IIUC the red zone does not have to be
> protected. Yes, we might miss red zone corruption if it happens to race
> with a DMA transaction, but I have assumed that this is permissible. I
> regard the red zone as a useful debugging tool, not a safety measure
> that is guaranteed to detect any write beyond the buffer end.

Yeah, it's debugging, but too many false positives make the feature
pretty useless.

> > > > I'm not familiar with DMA stuff, but Vlastimil's idea does make it
> > > > easier for driver developer to write a driver to be used on
> > > > different ARCHs, which have different DMA alignment requirement.
> > > > Say if the minimal safe size is 8 bytes, the driver can just
> > > > request 8 bytes and ARCH_DMA_MINALIGN will automatically chose
> > > > the right size for it, which can save memory for ARCHs with
> > > > smaller alignment requirement. Meanwhile it does sacrifice part
> > > > of the redzone check ability, so I don't have preference here :)  
> > > 
> > > Let's clarify first who's expected to ensure the word alignment for
> > > DMA, if it's not kmalloc() then I'd rather resist moving it there
> > > :)  
> > 
> > In theory, the DMA API should handle the alignment as I tried to
> > remove it from the kmalloc() code.
> 
> Are we talking about the alignment of the starting address, or buffer
> size, or both?

The DMA API bouncing logic only checks the buffer size and assumes
start/end would be aligned to kmalloc_size_roundup() with no valid data
between them (FROM_DEVICE).

> > With kmem_cache_create() (or kmalloc() as well), if the object size is
> > not cacheline-aligned, is there risk of redzoning around the object
> > without any alignment restrictions? The logic in
> > dma_kmalloc_size_aligned() would fail for sufficiently large buffers
> > but with unaligned red zone around the object.
> 
> This red zone is the extra memory that is normally wasted by kmalloc()
> rounding up the requested size to the bucket size.
> But dma_kmalloc_size_aligned() already uses kmalloc_size_roundup(size),
> so it seems to be covered.

Assuming I got kmalloc redzoning right, I think there's still a
potential issue. Let's say we have a system with 128-byte DMA alignment
required (the largest cache line size). We do a kmalloc(104) and
kmalloc_size_roundup() returns 128, so all seems good to the DMA code.
However, kmalloc() redzones from 104 to 128 as it tracks the original
size. The DMA bouncing doesn't spot it since the
kmalloc_size_roundup(104) is aligned to 128.

The above is a problem even for architectures that don't select
DMA_BOUNCE_UNALIGNED_KMALLOC but have non-coherent DMA (well, selecting
it may have a better chance of working if the buffers are small).

So I think 946fa0dbf2d8 ("mm/slub: extend redzone check to extra
allocated kmalloc space than requested") is broken on most architectures
that select ARCH_HAS_SYNC_DMA_FOR_DEVICE (arm64 is ok as it does a
write-back in arch_sync_dma_for_device() irrespective of direction).

We can hide this extended kmalloc() redzoning behind an arch select but,
as it is, I'd only do redzoning from an ALIGN(orig_size,
dma_get_cache_alignment()) offset.

Is the combination of SLAB_HWCACHE_ALIGN and SLAB_RED_ZONE similarly
affected? At least that's an explicit opt in and people shouldn't pass
it if they intend the objects to be used for DMA.

-- 
Catalin

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slub - extended kmalloc redzone and dma alignment
  2025-04-08 15:07                   ` Catalin Marinas
@ 2025-04-09  8:39                     ` Petr Tesarik
  2025-04-09  9:05                       ` Petr Tesarik
  2025-04-09  8:51                     ` Vlastimil Babka
  1 sibling, 1 reply; 24+ messages in thread
From: Petr Tesarik @ 2025-04-09  8:39 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Vlastimil Babka, Feng Tang, Harry Yoo, Peng Fan, Hyeonggon Yoo,
	David Rientjes, Christoph Lameter, linux-mm@kvack.org,
	Robin Murphy, Sean Christopherson, Halil Pasic

On Tue, 8 Apr 2025 16:07:19 +0100
Catalin Marinas <catalin.marinas@arm.com> wrote:

> On Tue, Apr 08, 2025 at 07:27:32AM +0200, Petr Tesarik wrote:
> > On Mon, 7 Apr 2025 18:12:09 +0100
> > Catalin Marinas <catalin.marinas@arm.com> wrote:  
> > > Thanks for looping me in. I'm just catching up with this thread.
> > > 
> > > On Mon, Apr 07, 2025 at 09:54:41AM +0200, Vlastimil Babka wrote:  
> > > > On 4/7/25 09:21, Feng Tang wrote:    
> > > > > On Sun, Apr 06, 2025 at 10:02:40PM +0800, Feng Tang wrote:
> > > > > [...]    
> > > > >> > I can remember this series, as well as my confusion why
> > > > >> > 192-byte kmalloc caches were missing on arm64.
> > > > >> > 
> > > > >> > Nevertheless, I believe ARCH_DMA_MINALIGN is required to avoid
> > > > >> > putting a DMA buffer on the same cache line as some other data
> > > > >> > that might be _written_ by the CPU while the corresponding
> > > > >> > main memory is modified by another bus-mastering device.
> > > > >> > 
> > > > >> > Consider this layout:
> > > > >> > 
> > > > >> >  ... | DMA buffer | other data | ...
> > > > >> >      ^                         ^
> > > > >> >      +-------------------------+-- cache line boundaries
> > > > >> > 
> > > > >> > When you prepare for DMA, you make sure that the DMA buffer is
> > > > >> > not cached by the CPU, so you flush the cache line (from all
> > > > >> > levels). Then you tell the device to write into the DMA
> > > > >> > buffer. However, before the device finishes the DMA
> > > > >> > transaction, the CPU accesses "other data", loading this cache
> > > > >> > line from main memory with partial results. Worse, if the CPU
> > > > >> > writes to "other data", it may write the cache line back into
> > > > >> > main memory, racing with the device writing to DMA buffer, and
> > > > >> > you end up with corrupted data in DMA buffer.    
> > > 
> > > Yes, cache evictions from 'other data; can override the DMA. Another
> > > problem, when the DMA completed, the kernel does a cache invalidation
> > > to remove any speculatively loaded cache lines from the DMA buffer
> > > but that would also invalidate 'other data', potentially corrupting
> > > it if it was dirty.
> > > 
> > > So it's not safe to have DMA into buffers less than ARCH_DMA_MINALIGN
> > > (and unaligned).  
> > 
> > It's not safe to DMA into buffers that share a CPU cache line with other
> > data, which could be before or after the DMA buffer, of course.  
>[...]
> While I think we are ok for arm64, other architectures may invalidate
> the caches in the arch_sync_dma_for_device() which could discard the red
> zone data. A quick grep for arch_sync_dma_for_device() shows several
> architectures invalidating the caches in the FROM_DEVICE case.

Wait. This sounds broken for partial writes into the DMA buffer, i.e.
where only part of the buffer is updated by a bus-mastering device.
When I worked on swiotlb, I was told that such partial updates must
be supported, and that's why the initial swiotlb_bounce() cannot be
removed from swiotlb_tbl_map_single(). In fact, the comment says:

	/*
	 * When the device is writing memory, i.e. dir == DMA_FROM_DEVICE, copy
	 * the original buffer to the TLB buffer before initiating DMA in order
	 * to preserve the original's data if the device does a partial write,
	 * i.e. if the device doesn't overwrite the entire buffer.  Preserving
	 * the original data, even if it's garbage, is necessary to match
	 * hardware behavior.  Use of swiotlb is supposed to be transparent,
	 * i.e. swiotlb must not corrupt memory by clobbering unwritten bytes.
	 */

You may want to check commit ddbd89deb7d3 ("swiotlb: fix info leak with
DMA_FROM_DEVICE"), commit aa6f8dcbab47 ("swiotlb: rework "fix info leak
with DMA_FROM_DEVICE") and commit 1132a1dc053e ("swiotlb: rewrite
comment explaining why the source is preserved on DMA_FROM_DEVICE").

I believe there is potential for a nasty race condition, and maybe even
info leak. Consider this:

1. DMA buffer is allocated by kmalloc(). The memory area previously
   contained sensitive information, which had been written to main
   memory.
2. The DMA buffer is initialized with zeroes, but this new content
   stays in a CPU cache (because this is kernel memory with a write
   behind cache policy).
3. DMA is set up, but nothing is written to main memory by the
   bus-mastering device.
4. The CPU cache line is now discarded in arch_sync_dma_for_cpu().

IIUC the zeroes were never written to main memory, and previous content
can now be read by the CPU through the DMA buffer.

I haven't checked if any architecture is affected, but I strongly
believe that the CPU cache MUST be flushed both before and after the
DMA transfer. Any architecture which does not do it that way should be
fixed.

Or did I miss a crucial detail (again)?

> > > What I did with reducing the minimum kmalloc()
> > > alignment was to force bouncing via swiotlb if the size passed to
> > > the DMA API is small. It may end up bouncing buffers that did not
> > > originate from kmalloc() or have proper alignment (with padding)
> > > but that's some heuristics we were willing to accept to be able
> > > to use small kmalloc() caches on arm64 - see
> > > dma_kmalloc_needs_bounce().
> > > 
> > > Does redzoning apply to kmalloc() or kmem_cache_create() (or
> > > both)? I haven't checked yet but if the red zone is within
> > > ARCH_DMA_MINALIGN (or rather dma_get_cache_alignment()), we could
> > > have issues with either corrupting the DMA buffer or the red
> > > zone. [...]  
> > 
> > I'm sorry if I'm being thick, but IIUC the red zone does not have
> > to be protected. Yes, we might miss red zone corruption if it
> > happens to race with a DMA transaction, but I have assumed that
> > this is permissible. I regard the red zone as a useful debugging
> > tool, not a safety measure that is guaranteed to detect any write
> > beyond the buffer end.  
> 
> Yeah, it's debugging, but too many false positives make the feature
> pretty useless.

Agreed. I was defending only sporadic false negatives. If there is room
for false positives, that's a no-go.

> > > > > I'm not familiar with DMA stuff, but Vlastimil's idea does make it
> > > > > easier for driver developer to write a driver to be used on
> > > > > different ARCHs, which have different DMA alignment requirement.
> > > > > Say if the minimal safe size is 8 bytes, the driver can just
> > > > > request 8 bytes and ARCH_DMA_MINALIGN will automatically chose
> > > > > the right size for it, which can save memory for ARCHs with
> > > > > smaller alignment requirement. Meanwhile it does sacrifice part
> > > > > of the redzone check ability, so I don't have preference here :)    
> > > > 
> > > > Let's clarify first who's expected to ensure the word alignment for
> > > > DMA, if it's not kmalloc() then I'd rather resist moving it there
> > > > :)    
> > > 
> > > In theory, the DMA API should handle the alignment as I tried to
> > > remove it from the kmalloc() code.  
> > 
> > Are we talking about the alignment of the starting address, or buffer
> > size, or both?  
> 
> The DMA API bouncing logic only checks the buffer size and assumes
> start/end would be aligned to kmalloc_size_roundup() with no valid data
> between them (FROM_DEVICE).

If that's correct behavior, then it must be documented, because other
places seem to disagree on the assumptions about prior content of a
DMA buffer (FROM_DEVICE). See the swiotlb comment above.

Petr T


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slub - extended kmalloc redzone and dma alignment
  2025-04-08 15:07                   ` Catalin Marinas
  2025-04-09  8:39                     ` Petr Tesarik
@ 2025-04-09  8:51                     ` Vlastimil Babka
  2025-04-09 11:11                       ` Catalin Marinas
  1 sibling, 1 reply; 24+ messages in thread
From: Vlastimil Babka @ 2025-04-09  8:51 UTC (permalink / raw)
  To: Catalin Marinas, Petr Tesarik
  Cc: Feng Tang, Harry Yoo, Peng Fan, Hyeonggon Yoo, David Rientjes,
	Christoph Lameter, linux-mm@kvack.org

On 4/8/25 5:07 PM, Catalin Marinas wrote:

> Assuming I got kmalloc redzoning right, I think there's still a
> potential issue. Let's say we have a system with 128-byte DMA alignment
> required (the largest cache line size). We do a kmalloc(104) and
> kmalloc_size_roundup() returns 128, so all seems good to the DMA code.
> However, kmalloc() redzones from 104 to 128 as it tracks the original
> size. The DMA bouncing doesn't spot it since the
> kmalloc_size_roundup(104) is aligned to 128.

Note that kmalloc_size_roundup() is supposed to be used *before*
kmalloc(), such as dma_resv_list_alloc() does. Then there's no issue as
no redzoning would not be done between 104 and 128, there would be only
the additional redzone at 128+.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slub - extended kmalloc redzone and dma alignment
  2025-04-09  8:39                     ` Petr Tesarik
@ 2025-04-09  9:05                       ` Petr Tesarik
  2025-04-09  9:47                         ` Catalin Marinas
  0 siblings, 1 reply; 24+ messages in thread
From: Petr Tesarik @ 2025-04-09  9:05 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Vlastimil Babka, Feng Tang, Harry Yoo, Peng Fan, Hyeonggon Yoo,
	David Rientjes, Christoph Lameter, linux-mm@kvack.org,
	Robin Murphy, Sean Christopherson, Halil Pasic

On Wed, 9 Apr 2025 10:39:04 +0200
Petr Tesarik <ptesarik@suse.com> wrote:

> On Tue, 8 Apr 2025 16:07:19 +0100
> Catalin Marinas <catalin.marinas@arm.com> wrote:
> 
> > On Tue, Apr 08, 2025 at 07:27:32AM +0200, Petr Tesarik wrote:  
> > > On Mon, 7 Apr 2025 18:12:09 +0100
> > > Catalin Marinas <catalin.marinas@arm.com> wrote:    
> > > > Thanks for looping me in. I'm just catching up with this thread.
> > > > 
> > > > On Mon, Apr 07, 2025 at 09:54:41AM +0200, Vlastimil Babka wrote:    
> > > > > On 4/7/25 09:21, Feng Tang wrote:      
> > > > > > On Sun, Apr 06, 2025 at 10:02:40PM +0800, Feng Tang wrote:
> > > > > > [...]      
> > > > > >> > I can remember this series, as well as my confusion why
> > > > > >> > 192-byte kmalloc caches were missing on arm64.
> > > > > >> > 
> > > > > >> > Nevertheless, I believe ARCH_DMA_MINALIGN is required to avoid
> > > > > >> > putting a DMA buffer on the same cache line as some other data
> > > > > >> > that might be _written_ by the CPU while the corresponding
> > > > > >> > main memory is modified by another bus-mastering device.
> > > > > >> > 
> > > > > >> > Consider this layout:
> > > > > >> > 
> > > > > >> >  ... | DMA buffer | other data | ...
> > > > > >> >      ^                         ^
> > > > > >> >      +-------------------------+-- cache line boundaries
> > > > > >> > 
> > > > > >> > When you prepare for DMA, you make sure that the DMA buffer is
> > > > > >> > not cached by the CPU, so you flush the cache line (from all
> > > > > >> > levels). Then you tell the device to write into the DMA
> > > > > >> > buffer. However, before the device finishes the DMA
> > > > > >> > transaction, the CPU accesses "other data", loading this cache
> > > > > >> > line from main memory with partial results. Worse, if the CPU
> > > > > >> > writes to "other data", it may write the cache line back into
> > > > > >> > main memory, racing with the device writing to DMA buffer, and
> > > > > >> > you end up with corrupted data in DMA buffer.      
> > > > 
> > > > Yes, cache evictions from 'other data; can override the DMA. Another
> > > > problem, when the DMA completed, the kernel does a cache invalidation
> > > > to remove any speculatively loaded cache lines from the DMA buffer
> > > > but that would also invalidate 'other data', potentially corrupting
> > > > it if it was dirty.
> > > > 
> > > > So it's not safe to have DMA into buffers less than ARCH_DMA_MINALIGN
> > > > (and unaligned).    
> > > 
> > > It's not safe to DMA into buffers that share a CPU cache line with other
> > > data, which could be before or after the DMA buffer, of course.    
> >[...]
> > While I think we are ok for arm64, other architectures may invalidate
> > the caches in the arch_sync_dma_for_device() which could discard the red
> > zone data. A quick grep for arch_sync_dma_for_device() shows several
> > architectures invalidating the caches in the FROM_DEVICE case.  
> 
> Wait. This sounds broken for partial writes into the DMA buffer, i.e.
> where only part of the buffer is updated by a bus-mastering device.
> When I worked on swiotlb, I was told that such partial updates must
> be supported, and that's why the initial swiotlb_bounce() cannot be
> removed from swiotlb_tbl_map_single(). In fact, the comment says:
> 
> 	/*
> 	 * When the device is writing memory, i.e. dir == DMA_FROM_DEVICE, copy
> 	 * the original buffer to the TLB buffer before initiating DMA in order
> 	 * to preserve the original's data if the device does a partial write,
> 	 * i.e. if the device doesn't overwrite the entire buffer.  Preserving
> 	 * the original data, even if it's garbage, is necessary to match
> 	 * hardware behavior.  Use of swiotlb is supposed to be transparent,
> 	 * i.e. swiotlb must not corrupt memory by clobbering unwritten bytes.
> 	 */
> 
> You may want to check commit ddbd89deb7d3 ("swiotlb: fix info leak with
> DMA_FROM_DEVICE"), commit aa6f8dcbab47 ("swiotlb: rework "fix info leak
> with DMA_FROM_DEVICE") and commit 1132a1dc053e ("swiotlb: rewrite
> comment explaining why the source is preserved on DMA_FROM_DEVICE").
> 
> I believe there is potential for a nasty race condition, and maybe even
> info leak. Consider this:
> 
> 1. DMA buffer is allocated by kmalloc(). The memory area previously
>    contained sensitive information, which had been written to main
>    memory.
> 2. The DMA buffer is initialized with zeroes, but this new content
>    stays in a CPU cache (because this is kernel memory with a write
>    behind cache policy).
> 3. DMA is set up, but nothing is written to main memory by the
>    bus-mastering device.
> 4. The CPU cache line is now discarded in arch_sync_dma_for_cpu().
> 
> IIUC the zeroes were never written to main memory, and previous content
> can now be read by the CPU through the DMA buffer.
> 
> I haven't checked if any architecture is affected, but I strongly
> believe that the CPU cache MUST be flushed both before and after the
> DMA transfer. Any architecture which does not do it that way should be
> fixed.
> 
> Or did I miss a crucial detail (again)?

Just after sending this, I realized I did. :(

There is a step between 2 and 3:

2a. arch_sync_dma_for_device() invalidates the CPU cache line.
    Architectures which do not write previous content to main memory
    effectively undo the zeroing here.

AFAICS the consequence is still the same: race condition and/or info
leak on partial (or zero) DMA write.

Petr T


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slub - extended kmalloc redzone and dma alignment
  2025-04-09  9:05                       ` Petr Tesarik
@ 2025-04-09  9:47                         ` Catalin Marinas
  2025-04-09 12:18                           ` Petr Tesarik
  0 siblings, 1 reply; 24+ messages in thread
From: Catalin Marinas @ 2025-04-09  9:47 UTC (permalink / raw)
  To: Petr Tesarik
  Cc: Vlastimil Babka, Feng Tang, Harry Yoo, Peng Fan, Hyeonggon Yoo,
	David Rientjes, Christoph Lameter, linux-mm@kvack.org,
	Robin Murphy, Sean Christopherson, Halil Pasic

On Wed, Apr 09, 2025 at 11:05:29AM +0200, Petr Tesarik wrote:
> On Wed, 9 Apr 2025 10:39:04 +0200
> Petr Tesarik <ptesarik@suse.com> wrote:
> > I believe there is potential for a nasty race condition, and maybe even
> > info leak. Consider this:
> > 
> > 1. DMA buffer is allocated by kmalloc(). The memory area previously
> >    contained sensitive information, which had been written to main
> >    memory.
> > 2. The DMA buffer is initialized with zeroes, but this new content
> >    stays in a CPU cache (because this is kernel memory with a write
> >    behind cache policy).
> > 3. DMA is set up, but nothing is written to main memory by the
> >    bus-mastering device.
> > 4. The CPU cache line is now discarded in arch_sync_dma_for_cpu().
> > 
> > IIUC the zeroes were never written to main memory, and previous content
> > can now be read by the CPU through the DMA buffer.
> > 
> > I haven't checked if any architecture is affected, but I strongly
> > believe that the CPU cache MUST be flushed both before and after the
> > DMA transfer. Any architecture which does not do it that way should be
> > fixed.
> > 
> > Or did I miss a crucial detail (again)?
> 
> Just after sending this, I realized I did. :(
> 
> There is a step between 2 and 3:
> 
> 2a. arch_sync_dma_for_device() invalidates the CPU cache line.
>     Architectures which do not write previous content to main memory
>     effectively undo the zeroing here.

Good point, that's a problem on those architectures that invalidate the
caches in arch_sync_dma_for_device(). We fixed it for arm64 in 5.19 -
c50f11c6196f ("arm64: mm: Don't invalidate FROM_DEVICE buffers at start
of DMA transfer") - for the same reasons, information leak.

So we could ignore all those architectures. If they people complain
about redzone failures, we can ask them to fix. Well, crude attempt
below at fixing those. I skipped powerpc and for arch/arm I only
addressed cache-v7. Completely untested.

But I wonder whether it's easier to fix the callers of
arch_sync_dma_for_device and always pass DMA_BIDIRECTIONAL for security
reasons:

----------------------------8<------------------------------
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index cb7e29dcac15..73ee3826a825 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1103,7 +1103,7 @@ void iommu_dma_sync_single_for_device(struct device *dev, dma_addr_t dma_handle,
 	swiotlb_sync_single_for_device(dev, phys, size, dir);

 	if (!dev_is_dma_coherent(dev))
-		arch_sync_dma_for_device(phys, size, dir);
+		arch_sync_dma_for_device(phys, size, DMA_BIDIRECTIONAL);
 }

 void iommu_dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sgl,
@@ -1134,7 +1134,8 @@ void iommu_dma_sync_sg_for_device(struct device *dev, struct scatterlist *sgl,
 							 sg->length, dir);
 	else if (!dev_is_dma_coherent(dev))
 		for_each_sg(sgl, sg, nelems, i)
-			arch_sync_dma_for_device(sg_phys(sg), sg->length, dir);
+			arch_sync_dma_for_device(sg_phys(sg), sg->length,
+						 DMA_BIDIRECTIONAL);
 }

 dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
@@ -1189,7 +1190,7 @@ dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
 	}

 	if (!coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
-		arch_sync_dma_for_device(phys, size, dir);
+		arch_sync_dma_for_device(phys, size, DMA_BIDIRECTIONAL);

 	iova = __iommu_dma_map(dev, phys, size, prot, dma_mask);
 	if (iova == DMA_MAPPING_ERROR)
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 1f65795cf5d7..6e508d7f4010 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -247,7 +247,7 @@ static dma_addr_t xen_swiotlb_map_page(struct device *dev, struct page *page,
 done:
 	if (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) {
 		if (pfn_valid(PFN_DOWN(dma_to_phys(dev, dev_addr))))
-			arch_sync_dma_for_device(phys, size, dir);
+			arch_sync_dma_for_device(phys, size, DMA_BIDIRECTIONAL);
 		else
 			xen_dma_sync_for_device(dev, dev_addr, size, dir);
 	}
@@ -316,7 +316,7 @@ xen_swiotlb_sync_single_for_device(struct device *dev, dma_addr_t dma_addr,

 	if (!dev_is_dma_coherent(dev)) {
 		if (pfn_valid(PFN_DOWN(dma_to_phys(dev, dma_addr))))
-			arch_sync_dma_for_device(paddr, size, dir);
+			arch_sync_dma_for_device(paddr, size, DMA_BIDIRECTIONAL);
 		else
 			xen_dma_sync_for_device(dev, dma_addr, size, dir);
 	}
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index b8fe0b3d0ffb..f4e8d23fd086 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -408,7 +408,7 @@ void dma_direct_sync_sg_for_device(struct device *dev,

 		if (!dev_is_dma_coherent(dev))
 			arch_sync_dma_for_device(paddr, sg->length,
-					dir);
+						 DMA_BIDIRECTIONAL);
 	}
 }
 #endif
diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index d2c0b7e632fc..d5f575e1a623 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -61,7 +61,7 @@ static inline void dma_direct_sync_single_for_device(struct device *dev,
 	swiotlb_sync_single_for_device(dev, paddr, size, dir);

 	if (!dev_is_dma_coherent(dev))
-		arch_sync_dma_for_device(paddr, size, dir);
+		arch_sync_dma_for_device(paddr, size, DMA_BIDIRECTIONAL);
 }

 static inline void dma_direct_sync_single_for_cpu(struct device *dev,
@@ -107,7 +107,7 @@ static inline dma_addr_t dma_direct_map_page(struct device *dev,
 	}

 	if (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
-		arch_sync_dma_for_device(phys, size, dir);
+		arch_sync_dma_for_device(phys, size, DMA_BIDIRECTIONAL);
 	return dma_addr;
 }

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index abcf3fa63a56..1e21bd65b08c 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -1598,7 +1598,7 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t paddr, size_t size,
 	}

 	if (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
-		arch_sync_dma_for_device(swiotlb_addr, size, dir);
+		arch_sync_dma_for_device(swiotlb_addr, size, DMA_BIDIRECTIONAL);
 	return dma_addr;
 }

----------------------------8<------------------------------

And that's the partial change for most arches but I'd rather go with the
above:

----------------------------8<------------------------------
diff --git a/arch/arc/mm/dma.c b/arch/arc/mm/dma.c
index 6b85e94f3275..2902b3378b21 100644
--- a/arch/arc/mm/dma.c
+++ b/arch/arc/mm/dma.c
@@ -51,22 +51,7 @@ void arch_dma_prep_coherent(struct page *page, size_t size)
 void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
 		enum dma_data_direction dir)
 {
-	switch (dir) {
-	case DMA_TO_DEVICE:
-		dma_cache_wback(paddr, size);
-		break;
-
-	case DMA_FROM_DEVICE:
-		dma_cache_inv(paddr, size);
-		break;
-
-	case DMA_BIDIRECTIONAL:
-		dma_cache_wback_inv(paddr, size);
-		break;
-
-	default:
-		break;
-	}
+	dma_cache_wback(paddr, size);
 }

 void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
diff --git a/arch/arm/mm/cache-v7.S b/arch/arm/mm/cache-v7.S
index 201ca05436fa..3787c4b839dd 100644
--- a/arch/arm/mm/cache-v7.S
+++ b/arch/arm/mm/cache-v7.S
@@ -441,8 +441,6 @@ SYM_FUNC_END(v7_dma_flush_range)
  */
 SYM_TYPED_FUNC_START(v7_dma_map_area)
 	add	r1, r1, r0
-	teq	r2, #DMA_FROM_DEVICE
-	beq	v7_dma_inv_range
 	b	v7_dma_clean_range
 SYM_FUNC_END(v7_dma_map_area)

diff --git a/arch/arm/mm/dma-mapping-nommu.c b/arch/arm/mm/dma-mapping-nommu.c
index fecac107fd0d..b2432726b082 100644
--- a/arch/arm/mm/dma-mapping-nommu.c
+++ b/arch/arm/mm/dma-mapping-nommu.c
@@ -17,11 +17,7 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
 		enum dma_data_direction dir)
 {
 	dmac_map_area(__va(paddr), size, dir);
-
-	if (dir == DMA_FROM_DEVICE)
-		outer_inv_range(paddr, paddr + size);
-	else
-		outer_clean_range(paddr, paddr + size);
+	outer_clean_range(paddr, paddr + size);
 }

 void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 88c2d68a69c9..ceae4c027f53 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -682,13 +682,7 @@ static void __dma_page_cpu_to_dev(struct page *page, unsigned long off,
 	phys_addr_t paddr;

 	dma_cache_maint_page(page, off, size, dir, dmac_map_area);
-
-	paddr = page_to_phys(page) + off;
-	if (dir == DMA_FROM_DEVICE) {
-		outer_inv_range(paddr, paddr + size);
-	} else {
-		outer_clean_range(paddr, paddr + size);
-	}
+	outer_clean_range(paddr, paddr + size);
 	/* FIXME: non-speculating: flush on bidirectional mappings? */
 }

diff --git a/arch/csky/mm/dma-mapping.c b/arch/csky/mm/dma-mapping.c
index 82447029feb4..3862a56cb3ac 100644
--- a/arch/csky/mm/dma-mapping.c
+++ b/arch/csky/mm/dma-mapping.c
@@ -58,17 +58,7 @@ void arch_dma_prep_coherent(struct page *page, size_t size)
 void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
 		enum dma_data_direction dir)
 {
-	switch (dir) {
-	case DMA_TO_DEVICE:
-		cache_op(paddr, size, dma_wb_range);
-		break;
-	case DMA_FROM_DEVICE:
-	case DMA_BIDIRECTIONAL:
-		cache_op(paddr, size, dma_wbinv_range);
-		break;
-	default:
-		BUG();
-	}
+	cache_op(paddr, size, dma_wb_range);
 }

 void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
diff --git a/arch/hexagon/kernel/dma.c b/arch/hexagon/kernel/dma.c
index 882680e81a30..8ca011acc4fc 100644
--- a/arch/hexagon/kernel/dma.c
+++ b/arch/hexagon/kernel/dma.c
@@ -14,22 +14,8 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
 {
 	void *addr = phys_to_virt(paddr);

-	switch (dir) {
-	case DMA_TO_DEVICE:
-		hexagon_clean_dcache_range((unsigned long) addr,
-		(unsigned long) addr + size);
-		break;
-	case DMA_FROM_DEVICE:
-		hexagon_inv_dcache_range((unsigned long) addr,
-		(unsigned long) addr + size);
-		break;
-	case DMA_BIDIRECTIONAL:
-		flush_dcache_range((unsigned long) addr,
-		(unsigned long) addr + size);
-		break;
-	default:
-		BUG();
-	}
+	hexagon_clean_dcache_range((unsigned long) addr,
+				   (unsigned long) addr + size);
 }

 /*
diff --git a/arch/m68k/kernel/dma.c b/arch/m68k/kernel/dma.c
index 16063783aa80..95902d306412 100644
--- a/arch/m68k/kernel/dma.c
+++ b/arch/m68k/kernel/dma.c
@@ -29,17 +29,5 @@ pgprot_t pgprot_dmacoherent(pgprot_t prot)
 void arch_sync_dma_for_device(phys_addr_t handle, size_t size,
 		enum dma_data_direction dir)
 {
-	switch (dir) {
-	case DMA_BIDIRECTIONAL:
-	case DMA_TO_DEVICE:
-		cache_push(handle, size);
-		break;
-	case DMA_FROM_DEVICE:
-		cache_clear(handle, size);
-		break;
-	default:
-		pr_err_ratelimited("dma_sync_single_for_device: unsupported dir %u\n",
-				   dir);
-		break;
-	}
+	cache_push(handle, size);
 }
diff --git a/arch/microblaze/kernel/dma.c b/arch/microblaze/kernel/dma.c
index 04d091ade417..68e6c946d273 100644
--- a/arch/microblaze/kernel/dma.c
+++ b/arch/microblaze/kernel/dma.c
@@ -14,14 +14,19 @@
 #include <linux/bug.h>
 #include <asm/cacheflush.h>

-static void __dma_sync(phys_addr_t paddr, size_t size,
-		enum dma_data_direction direction)
+void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
+		enum dma_data_direction dir)
+{
+	flush_dcache_range(paddr, paddr + size);
+}
+
+void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
+		enum dma_data_direction dir)
 {
 	switch (direction) {
 	case DMA_TO_DEVICE:
-	case DMA_BIDIRECTIONAL:
-		flush_dcache_range(paddr, paddr + size);
 		break;
+	case DMA_BIDIRECTIONAL:
 	case DMA_FROM_DEVICE:
 		invalidate_dcache_range(paddr, paddr + size);
 		break;
@@ -29,15 +34,3 @@ static void __dma_sync(phys_addr_t paddr, size_t size,
 		BUG();
 	}
 }
-
-void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
-		enum dma_data_direction dir)
-{
-	__dma_sync(paddr, size, dir);
-}
-
-void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
-		enum dma_data_direction dir)
-{
-	__dma_sync(paddr, size, dir);
-}
diff --git a/arch/nios2/mm/dma-mapping.c b/arch/nios2/mm/dma-mapping.c
index fd887d5f3f9a..35730ad8787d 100644
--- a/arch/nios2/mm/dma-mapping.c
+++ b/arch/nios2/mm/dma-mapping.c
@@ -23,23 +23,12 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
 {
 	void *vaddr = phys_to_virt(paddr);

-	switch (dir) {
-	case DMA_FROM_DEVICE:
-		invalidate_dcache_range((unsigned long)vaddr,
-			(unsigned long)(vaddr + size));
-		break;
-	case DMA_TO_DEVICE:
-		/*
-		 * We just need to flush the caches here , but Nios2 flush
-		 * instruction will do both writeback and invalidate.
-		 */
-	case DMA_BIDIRECTIONAL: /* flush and invalidate */
-		flush_dcache_range((unsigned long)vaddr,
-			(unsigned long)(vaddr + size));
-		break;
-	default:
-		BUG();
-	}
+	/*
+	 * We just need to flush the caches here , but Nios2 flush
+	 * instruction will do both writeback and invalidate.
+	 */
+	flush_dcache_range((unsigned long)vaddr,
+			   (unsigned long)(vaddr + size));
 }

 void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
diff --git a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c
index b3edbb33b621..747218e17237 100644
--- a/arch/openrisc/kernel/dma.c
+++ b/arch/openrisc/kernel/dma.c
@@ -101,25 +101,8 @@ void arch_sync_dma_for_device(phys_addr_t addr, size_t size,
 	unsigned long cl;
 	struct cpuinfo_or1k *cpuinfo = &cpuinfo_or1k[smp_processor_id()];

-	switch (dir) {
-	case DMA_TO_DEVICE:
-		/* Flush the dcache for the requested range */
-		for (cl = addr; cl < addr + size;
-		     cl += cpuinfo->dcache_block_size)
-			mtspr(SPR_DCBFR, cl);
-		break;
-	case DMA_FROM_DEVICE:
-		/* Invalidate the dcache for the requested range */
-		for (cl = addr; cl < addr + size;
-		     cl += cpuinfo->dcache_block_size)
-			mtspr(SPR_DCBIR, cl);
-		break;
-	default:
-		/*
-		 * NOTE: If dir == DMA_BIDIRECTIONAL then there's no need to
-		 * flush nor invalidate the cache here as the area will need
-		 * to be manually synced anyway.
-		 */
-		break;
-	}
+	/* Flush the dcache for the requested range */
+	for (cl = addr; cl < addr + size;
+	     cl += cpuinfo->dcache_block_size)
+		mtspr(SPR_DCBFR, cl);
 }
diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
index cb89d7e0ba88..2e6734c2a20b 100644
--- a/arch/riscv/mm/dma-noncoherent.c
+++ b/arch/riscv/mm/dma-noncoherent.c
@@ -69,30 +69,7 @@ static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
 void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
 			      enum dma_data_direction dir)
 {
-	switch (dir) {
-	case DMA_TO_DEVICE:
-		arch_dma_cache_wback(paddr, size);
-		break;
-
-	case DMA_FROM_DEVICE:
-		if (!arch_sync_dma_clean_before_fromdevice()) {
-			arch_dma_cache_inv(paddr, size);
-			break;
-		}
-		fallthrough;
-
-	case DMA_BIDIRECTIONAL:
-		/* Skip the invalidate here if it's done later */
-		if (IS_ENABLED(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) &&
-		    arch_sync_dma_cpu_needs_post_dma_flush())
-			arch_dma_cache_wback(paddr, size);
-		else
-			arch_dma_cache_wback_inv(paddr, size);
-		break;
-
-	default:
-		break;
-	}
+	arch_dma_cache_wback(paddr, size);
 }

 void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
diff --git a/arch/sh/kernel/dma-coherent.c b/arch/sh/kernel/dma-coherent.c
index 6a44c0e7ba40..1e0491f9b026 100644
--- a/arch/sh/kernel/dma-coherent.c
+++ b/arch/sh/kernel/dma-coherent.c
@@ -17,17 +17,5 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
 {
 	void *addr = sh_cacheop_vaddr(phys_to_virt(paddr));

-	switch (dir) {
-	case DMA_FROM_DEVICE:		/* invalidate only */
-		__flush_invalidate_region(addr, size);
-		break;
-	case DMA_TO_DEVICE:		/* writeback only */
-		__flush_wback_region(addr, size);
-		break;
-	case DMA_BIDIRECTIONAL:		/* writeback and invalidate */
-		__flush_purge_region(addr, size);
-		break;
-	default:
-		BUG();
-	}
+	__flush_wback_region(addr, size);
 }
diff --git a/arch/xtensa/kernel/pci-dma.c b/arch/xtensa/kernel/pci-dma.c
index 94955caa4488..3da1ee2b5d84 100644
--- a/arch/xtensa/kernel/pci-dma.c
+++ b/arch/xtensa/kernel/pci-dma.c
@@ -64,20 +64,8 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
 void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
 		enum dma_data_direction dir)
 {
-	switch (dir) {
-	case DMA_BIDIRECTIONAL:
-	case DMA_TO_DEVICE:
-		if (XCHAL_DCACHE_IS_WRITEBACK)
-			do_cache_op(paddr, size, __flush_dcache_range);
-		break;
-
-	case DMA_NONE:
-		BUG();
-		break;
-
-	default:
-		break;
-	}
+	if (XCHAL_DCACHE_IS_WRITEBACK)
+		do_cache_op(paddr, size, __flush_dcache_range);
 }

 void arch_dma_prep_coherent(struct page *page, size_t size)


-- 
Catalin


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: slub - extended kmalloc redzone and dma alignment
  2025-04-09  8:51                     ` Vlastimil Babka
@ 2025-04-09 11:11                       ` Catalin Marinas
  2025-04-09 12:22                         ` Vlastimil Babka
  0 siblings, 1 reply; 24+ messages in thread
From: Catalin Marinas @ 2025-04-09 11:11 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Petr Tesarik, Feng Tang, Harry Yoo, Peng Fan, Hyeonggon Yoo,
	David Rientjes, Christoph Lameter, linux-mm@kvack.org

On Wed, Apr 09, 2025 at 10:51:43AM +0200, Vlastimil Babka wrote:
> On 4/8/25 5:07 PM, Catalin Marinas wrote:
> > Assuming I got kmalloc redzoning right, I think there's still a
> > potential issue. Let's say we have a system with 128-byte DMA alignment
> > required (the largest cache line size). We do a kmalloc(104) and
> > kmalloc_size_roundup() returns 128, so all seems good to the DMA code.
> > However, kmalloc() redzones from 104 to 128 as it tracks the original
> > size. The DMA bouncing doesn't spot it since the
> > kmalloc_size_roundup(104) is aligned to 128.
> 
> Note that kmalloc_size_roundup() is supposed to be used *before*
> kmalloc(), such as dma_resv_list_alloc() does. Then there's no issue as
> no redzoning would not be done between 104 and 128, there would be only
> the additional redzone at 128+.

Yes, if people use it this way. devm_kmalloc() via alloc_dr() also seems
to be handling this. However, given the original report, I assume there
are drivers that have a problem with redzoning at the end of the buffer.

I did a quick test with kmem_cache_create() of 104 bytes with
SLAB_HWCACHE_ALIGN (64 bytes) and it has a similar problem with the
redzone from byte 104 onwards. Here we don't have the equivalent of
kmalloc_size_roundup() that a driver can use.

-- 
Catalin


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slub - extended kmalloc redzone and dma alignment
  2025-04-09  9:47                         ` Catalin Marinas
@ 2025-04-09 12:18                           ` Petr Tesarik
  2025-04-09 12:49                             ` Catalin Marinas
  0 siblings, 1 reply; 24+ messages in thread
From: Petr Tesarik @ 2025-04-09 12:18 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Vlastimil Babka, Feng Tang, Harry Yoo, Peng Fan, Hyeonggon Yoo,
	David Rientjes, Christoph Lameter, linux-mm@kvack.org,
	Robin Murphy, Sean Christopherson, Halil Pasic

On Wed, 9 Apr 2025 10:47:36 +0100
Catalin Marinas <catalin.marinas@arm.com> wrote:

> On Wed, Apr 09, 2025 at 11:05:29AM +0200, Petr Tesarik wrote:
> > On Wed, 9 Apr 2025 10:39:04 +0200
> > Petr Tesarik <ptesarik@suse.com> wrote:  
> > > I believe there is potential for a nasty race condition, and maybe even
> > > info leak. Consider this:
> > > 
> > > 1. DMA buffer is allocated by kmalloc(). The memory area previously
> > >    contained sensitive information, which had been written to main
> > >    memory.
> > > 2. The DMA buffer is initialized with zeroes, but this new content
> > >    stays in a CPU cache (because this is kernel memory with a write
> > >    behind cache policy).
> > > 3. DMA is set up, but nothing is written to main memory by the
> > >    bus-mastering device.
> > > 4. The CPU cache line is now discarded in arch_sync_dma_for_cpu().
> > > 
> > > IIUC the zeroes were never written to main memory, and previous content
> > > can now be read by the CPU through the DMA buffer.
> > > 
> > > I haven't checked if any architecture is affected, but I strongly
> > > believe that the CPU cache MUST be flushed both before and after the
> > > DMA transfer. Any architecture which does not do it that way should be
> > > fixed.
> > > 
> > > Or did I miss a crucial detail (again)?  
> > 
> > Just after sending this, I realized I did. :(
> > 
> > There is a step between 2 and 3:
> > 
> > 2a. arch_sync_dma_for_device() invalidates the CPU cache line.
> >     Architectures which do not write previous content to main memory
> >     effectively undo the zeroing here.  
> 
> Good point, that's a problem on those architectures that invalidate the
> caches in arch_sync_dma_for_device(). We fixed it for arm64 in 5.19 -
> c50f11c6196f ("arm64: mm: Don't invalidate FROM_DEVICE buffers at start
> of DMA transfer") - for the same reasons, information leak.
> 
> So we could ignore all those architectures. If they people complain
> about redzone failures, we can ask them to fix. Well, crude attempt
> below at fixing those. I skipped powerpc and for arch/arm I only
> addressed cache-v7. Completely untested.
> 
> But I wonder whether it's easier to fix the callers of
> arch_sync_dma_for_device and always pass DMA_BIDIRECTIONAL for security
> reasons:

Then we could remove this parameter as useless. ;-)

Now, arch_sync_for_device() must always do the same thing, regardless
of the transfer direction: Write back any cached DMA buffer data to
main memory. Except, if an architectures can do this with or without
invalidating the cache line _and_ it never loads cache content
speculatively (does not even have a prefetch instruction), it might
optimize like this:

- DMA_FROM_DEVICE:
  - arch_sync_for_device(): writeback and invalidate
  - arch_sync_for_cpu(): nothing

- anythig else:
  - arch_sync_for_device(): writeback only
  - arch_sync_for_cpu(): invalidate unless DMA_TO_DEVICE

I don't know if there's any such non-prefetching architecture, much
less if this optimization would make sense there. It's just the only
scenario I can imagine where the direction parameter makes any
difference for arch_sync_for_device().

Unless you are aware of such combination, I suggest to kill the
parameter and implement the "anything else" case above for all
architectures. 

Petr T


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slub - extended kmalloc redzone and dma alignment
  2025-04-09 11:11                       ` Catalin Marinas
@ 2025-04-09 12:22                         ` Vlastimil Babka
  2025-04-09 14:30                           ` Catalin Marinas
  0 siblings, 1 reply; 24+ messages in thread
From: Vlastimil Babka @ 2025-04-09 12:22 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Petr Tesarik, Feng Tang, Harry Yoo, Peng Fan, Hyeonggon Yoo,
	David Rientjes, Christoph Lameter, linux-mm@kvack.org

On 4/9/25 1:11 PM, Catalin Marinas wrote:
> On Wed, Apr 09, 2025 at 10:51:43AM +0200, Vlastimil Babka wrote:
>> On 4/8/25 5:07 PM, Catalin Marinas wrote:
>>> Assuming I got kmalloc redzoning right, I think there's still a
>>> potential issue. Let's say we have a system with 128-byte DMA alignment
>>> required (the largest cache line size). We do a kmalloc(104) and
>>> kmalloc_size_roundup() returns 128, so all seems good to the DMA code.
>>> However, kmalloc() redzones from 104 to 128 as it tracks the original
>>> size. The DMA bouncing doesn't spot it since the
>>> kmalloc_size_roundup(104) is aligned to 128.
>>
>> Note that kmalloc_size_roundup() is supposed to be used *before*
>> kmalloc(), such as dma_resv_list_alloc() does. Then there's no issue as
>> no redzoning would not be done between 104 and 128, there would be only
>> the additional redzone at 128+.
> 
> Yes, if people use it this way. devm_kmalloc() via alloc_dr() also seems
> to be handling this. However, given the original report, I assume there

We can probably ignore my original private discussion as motivation as
it wasn't confirmed (and I'm not sure it will) that it was really a case
involving DMA alignment. It was just something I thought might be
possible explanation and wanted to doublecheck with people more
knowledgeable.

Unless you mean original report as 120ee599b5bf ("staging: octeon-usb:
prevent memory corruption") that Feng mentioned.

> are drivers that have a problem with redzoning at the end of the buffer.

So I'm not aware of any issues reported due to the extended redzoning.

> I did a quick test with kmem_cache_create() of 104 bytes with
> SLAB_HWCACHE_ALIGN (64 bytes) and it has a similar problem with the
> redzone from byte 104 onwards. Here we don't have the equivalent of
> kmalloc_size_roundup() that a driver can use.

AFAIK SLAB_HWCACHE_ALIGN exists for performance reasons, not to provide
dma guarantees as kmalloc(). So I'd say users of kmem_cache_create()
would have to do their own rounding - you mentioned
dma_get_cache_alignment()? And there's an align parameter too when
creating caches.




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slub - extended kmalloc redzone and dma alignment
  2025-04-09 12:18                           ` Petr Tesarik
@ 2025-04-09 12:49                             ` Catalin Marinas
  2025-04-09 13:41                               ` Petr Tesarik
  0 siblings, 1 reply; 24+ messages in thread
From: Catalin Marinas @ 2025-04-09 12:49 UTC (permalink / raw)
  To: Petr Tesarik
  Cc: Vlastimil Babka, Feng Tang, Harry Yoo, Peng Fan, Hyeonggon Yoo,
	David Rientjes, Christoph Lameter, linux-mm@kvack.org,
	Robin Murphy, Sean Christopherson, Halil Pasic

On Wed, Apr 09, 2025 at 02:18:16PM +0200, Petr Tesarik wrote:
> On Wed, 9 Apr 2025 10:47:36 +0100
> Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Wed, Apr 09, 2025 at 11:05:29AM +0200, Petr Tesarik wrote:
> > > On Wed, 9 Apr 2025 10:39:04 +0200
> > > Petr Tesarik <ptesarik@suse.com> wrote:  
> > > > I believe there is potential for a nasty race condition, and maybe even
> > > > info leak. Consider this:
> > > > 
> > > > 1. DMA buffer is allocated by kmalloc(). The memory area previously
> > > >    contained sensitive information, which had been written to main
> > > >    memory.
> > > > 2. The DMA buffer is initialized with zeroes, but this new content
> > > >    stays in a CPU cache (because this is kernel memory with a write
> > > >    behind cache policy).
> > > > 3. DMA is set up, but nothing is written to main memory by the
> > > >    bus-mastering device.
> > > > 4. The CPU cache line is now discarded in arch_sync_dma_for_cpu().
> > > > 
> > > > IIUC the zeroes were never written to main memory, and previous content
> > > > can now be read by the CPU through the DMA buffer.
> > > > 
> > > > I haven't checked if any architecture is affected, but I strongly
> > > > believe that the CPU cache MUST be flushed both before and after the
> > > > DMA transfer. Any architecture which does not do it that way should be
> > > > fixed.
> > > > 
> > > > Or did I miss a crucial detail (again)?  
> > > 
> > > Just after sending this, I realized I did. :(
> > > 
> > > There is a step between 2 and 3:
> > > 
> > > 2a. arch_sync_dma_for_device() invalidates the CPU cache line.
> > >     Architectures which do not write previous content to main memory
> > >     effectively undo the zeroing here.  
> > 
> > Good point, that's a problem on those architectures that invalidate the
> > caches in arch_sync_dma_for_device(). We fixed it for arm64 in 5.19 -
> > c50f11c6196f ("arm64: mm: Don't invalidate FROM_DEVICE buffers at start
> > of DMA transfer") - for the same reasons, information leak.
> > 
> > So we could ignore all those architectures. If they people complain
> > about redzone failures, we can ask them to fix. Well, crude attempt
> > below at fixing those. I skipped powerpc and for arch/arm I only
> > addressed cache-v7. Completely untested.
> > 
> > But I wonder whether it's easier to fix the callers of
> > arch_sync_dma_for_device and always pass DMA_BIDIRECTIONAL for security
> > reasons:
> 
> Then we could remove this parameter as useless. ;-)

Yes, I just took the lazy approach of not changing the arch code.

> Now, arch_sync_for_device() must always do the same thing, regardless
> of the transfer direction: Write back any cached DMA buffer data to
> main memory. Except, if an architectures can do this with or without
> invalidating the cache line _and_ it never loads cache content
> speculatively (does not even have a prefetch instruction), it might
> optimize like this:
> 
> - DMA_FROM_DEVICE:
>   - arch_sync_for_device(): writeback and invalidate
>   - arch_sync_for_cpu(): nothing
> 
> - anythig else:
>   - arch_sync_for_device(): writeback only
>   - arch_sync_for_cpu(): invalidate unless DMA_TO_DEVICE
> 
> I don't know if there's any such non-prefetching architecture, much
> less if this optimization would make sense there. It's just the only
> scenario I can imagine where the direction parameter makes any
> difference for arch_sync_for_device().

We have at least some old arm cores (pre ARMv6) that have a no-op for
the arch_sync_dma_for_cpu() case. They rely on the for_device code to
invalidate the caches. That's why I suggested DMA_BIDIRECTIONAL as a
cover-all option. However, such default may affect the performance of
CPUs that would benefit from a writeback without invalidate.

Basically what we need is something like:

	if (dir == DMA_TO_DEVICE)
		dir = DMA_BIDIRECTIONAL;

We could hide this in the core DMA code (behind some macro/function) or
update the arch code to do this. I don't have the setup to test all of
them (I can test arm64, maybe modern arm32 but that's about it).

Or we ignore the problem and tell people to fix their DMA ops if they
see complains about slab redzones. We also ignore the potential
information leak we've had for decades.

-- 
Catalin


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slub - extended kmalloc redzone and dma alignment
  2025-04-09 12:49                             ` Catalin Marinas
@ 2025-04-09 13:41                               ` Petr Tesarik
  0 siblings, 0 replies; 24+ messages in thread
From: Petr Tesarik @ 2025-04-09 13:41 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Vlastimil Babka, Feng Tang, Harry Yoo, Peng Fan, Hyeonggon Yoo,
	David Rientjes, Christoph Lameter, linux-mm@kvack.org,
	Robin Murphy, Sean Christopherson, Halil Pasic

On Wed, 9 Apr 2025 13:49:48 +0100
Catalin Marinas <catalin.marinas@arm.com> wrote:

> On Wed, Apr 09, 2025 at 02:18:16PM +0200, Petr Tesarik wrote:
> > On Wed, 9 Apr 2025 10:47:36 +0100
> > Catalin Marinas <catalin.marinas@arm.com> wrote:  
> > > On Wed, Apr 09, 2025 at 11:05:29AM +0200, Petr Tesarik wrote:  
> > > > On Wed, 9 Apr 2025 10:39:04 +0200
> > > > Petr Tesarik <ptesarik@suse.com> wrote:    
> > > > > I believe there is potential for a nasty race condition, and maybe even
> > > > > info leak. Consider this:
> > > > > 
> > > > > 1. DMA buffer is allocated by kmalloc(). The memory area previously
> > > > >    contained sensitive information, which had been written to main
> > > > >    memory.
> > > > > 2. The DMA buffer is initialized with zeroes, but this new content
> > > > >    stays in a CPU cache (because this is kernel memory with a write
> > > > >    behind cache policy).
> > > > > 3. DMA is set up, but nothing is written to main memory by the
> > > > >    bus-mastering device.
> > > > > 4. The CPU cache line is now discarded in arch_sync_dma_for_cpu().
> > > > > 
> > > > > IIUC the zeroes were never written to main memory, and previous content
> > > > > can now be read by the CPU through the DMA buffer.
> > > > > 
> > > > > I haven't checked if any architecture is affected, but I strongly
> > > > > believe that the CPU cache MUST be flushed both before and after the
> > > > > DMA transfer. Any architecture which does not do it that way should be
> > > > > fixed.
> > > > > 
> > > > > Or did I miss a crucial detail (again)?    
> > > > 
> > > > Just after sending this, I realized I did. :(
> > > > 
> > > > There is a step between 2 and 3:
> > > > 
> > > > 2a. arch_sync_dma_for_device() invalidates the CPU cache line.
> > > >     Architectures which do not write previous content to main memory
> > > >     effectively undo the zeroing here.    
> > > 
> > > Good point, that's a problem on those architectures that invalidate the
> > > caches in arch_sync_dma_for_device(). We fixed it for arm64 in 5.19 -
> > > c50f11c6196f ("arm64: mm: Don't invalidate FROM_DEVICE buffers at start
> > > of DMA transfer") - for the same reasons, information leak.
> > > 
> > > So we could ignore all those architectures. If they people complain
> > > about redzone failures, we can ask them to fix. Well, crude attempt
> > > below at fixing those. I skipped powerpc and for arch/arm I only
> > > addressed cache-v7. Completely untested.
> > > 
> > > But I wonder whether it's easier to fix the callers of
> > > arch_sync_dma_for_device and always pass DMA_BIDIRECTIONAL for security
> > > reasons:  
> > 
> > Then we could remove this parameter as useless. ;-)  
> 
> Yes, I just took the lazy approach of not changing the arch code.
> 
> > Now, arch_sync_for_device() must always do the same thing, regardless
> > of the transfer direction: Write back any cached DMA buffer data to
> > main memory. Except, if an architectures can do this with or without
> > invalidating the cache line _and_ it never loads cache content
> > speculatively (does not even have a prefetch instruction), it might
> > optimize like this:
> > 
> > - DMA_FROM_DEVICE:
> >   - arch_sync_for_device(): writeback and invalidate
> >   - arch_sync_for_cpu(): nothing
> > 
> > - anythig else:
> >   - arch_sync_for_device(): writeback only
> >   - arch_sync_for_cpu(): invalidate unless DMA_TO_DEVICE
> > 
> > I don't know if there's any such non-prefetching architecture, much
> > less if this optimization would make sense there. It's just the only
> > scenario I can imagine where the direction parameter makes any
> > difference for arch_sync_for_device().  
> 
> We have at least some old arm cores (pre ARMv6) that have a no-op for
> the arch_sync_dma_for_cpu() case. They rely on the for_device code to
> invalidate the caches. That's why I suggested DMA_BIDIRECTIONAL as a
> cover-all option. However, such default may affect the performance of
> CPUs that would benefit from a writeback without invalidate.

Er, like ARM9 chips? Well, if even _you_ cannot test on those, they're
probably not worth the effort.

But yes, there might be another niche architecture, maybe not even
supported by Linux as of today, which would benefit from the
optimization. Let's keep the parameter.

> Basically what we need is something like:
> 
> 	if (dir == DMA_TO_DEVICE)
> 		dir = DMA_BIDIRECTIONAL;
> 
> We could hide this in the core DMA code (behind some macro/function) or
> update the arch code to do this. I don't have the setup to test all of
> them (I can test arm64, maybe modern arm32 but that's about it).

This sounds like a good plan. I've just grepped and there's exactly
nine callers of arch_sync_dma_for_device() in arch-independent code,
plus 4 under arch/mips.

Let's make the above an inline function. As for naming, I'm delighted
to say that sync_dma_for_device() is not yet used for anything. :-)

Petr T


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slub - extended kmalloc redzone and dma alignment
  2025-04-09 12:22                         ` Vlastimil Babka
@ 2025-04-09 14:30                           ` Catalin Marinas
  2025-04-10  1:54                             ` Feng Tang
  0 siblings, 1 reply; 24+ messages in thread
From: Catalin Marinas @ 2025-04-09 14:30 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Petr Tesarik, Feng Tang, Harry Yoo, Peng Fan, Hyeonggon Yoo,
	David Rientjes, Christoph Lameter, linux-mm@kvack.org

On Wed, Apr 09, 2025 at 02:22:10PM +0200, Vlastimil Babka wrote:
> On 4/9/25 1:11 PM, Catalin Marinas wrote:
> > On Wed, Apr 09, 2025 at 10:51:43AM +0200, Vlastimil Babka wrote:
> >> On 4/8/25 5:07 PM, Catalin Marinas wrote:
> >>> Assuming I got kmalloc redzoning right, I think there's still a
> >>> potential issue. Let's say we have a system with 128-byte DMA alignment
> >>> required (the largest cache line size). We do a kmalloc(104) and
> >>> kmalloc_size_roundup() returns 128, so all seems good to the DMA code.
> >>> However, kmalloc() redzones from 104 to 128 as it tracks the original
> >>> size. The DMA bouncing doesn't spot it since the
> >>> kmalloc_size_roundup(104) is aligned to 128.
> >>
> >> Note that kmalloc_size_roundup() is supposed to be used *before*
> >> kmalloc(), such as dma_resv_list_alloc() does. Then there's no issue as
> >> no redzoning would not be done between 104 and 128, there would be only
> >> the additional redzone at 128+.
> > 
> > Yes, if people use it this way. devm_kmalloc() via alloc_dr() also seems
> > to be handling this. However, given the original report, I assume there
> 
> We can probably ignore my original private discussion as motivation as
> it wasn't confirmed (and I'm not sure it will) that it was really a case
> involving DMA alignment. It was just something I thought might be
> possible explanation and wanted to doublecheck with people more
> knowledgeable.
> 
> Unless you mean original report as 120ee599b5bf ("staging: octeon-usb:
> prevent memory corruption") that Feng mentioned.

I was referring to your private discussion. IIUC the one Feng mentioned
was about the SLOB allocator which I recall did not guarantee natural
alignment for power-of-two allocations.

> > are drivers that have a problem with redzoning at the end of the buffer.
> 
> So I'm not aware of any issues reported due to the extended redzoning.

Thanks for confirming. I guess we can ignore this potential issue then
as long as drivers take care of the alignment or use devm_kmalloc().

> > I did a quick test with kmem_cache_create() of 104 bytes with
> > SLAB_HWCACHE_ALIGN (64 bytes) and it has a similar problem with the
> > redzone from byte 104 onwards. Here we don't have the equivalent of
> > kmalloc_size_roundup() that a driver can use.
> 
> AFAIK SLAB_HWCACHE_ALIGN exists for performance reasons, not to provide
> dma guarantees as kmalloc(). So I'd say users of kmem_cache_create()
> would have to do their own rounding - you mentioned
> dma_get_cache_alignment()? And there's an align parameter too when
> creating caches.

I just checked and the align parameter only ensures the start of the
buffer, the redzone start is not aligned.

Anyway, as in the other subthread with Petr, I think most architectures
would benefit from an update to the DMA cache maintenance to avoid
corrupting the redzone.

-- 
Catalin


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slub - extended kmalloc redzone and dma alignment
  2025-04-09 14:30                           ` Catalin Marinas
@ 2025-04-10  1:54                             ` Feng Tang
  0 siblings, 0 replies; 24+ messages in thread
From: Feng Tang @ 2025-04-10  1:54 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Vlastimil Babka, Petr Tesarik, Harry Yoo, Peng Fan, Hyeonggon Yoo,
	David Rientjes, Christoph Lameter, linux-mm@kvack.org, Kees Cook

On Wed, Apr 09, 2025 at 03:30:16PM +0100, Catalin Marinas wrote:
> On Wed, Apr 09, 2025 at 02:22:10PM +0200, Vlastimil Babka wrote:
> > On 4/9/25 1:11 PM, Catalin Marinas wrote:
> > > On Wed, Apr 09, 2025 at 10:51:43AM +0200, Vlastimil Babka wrote:
> > >> On 4/8/25 5:07 PM, Catalin Marinas wrote:
> > >>> Assuming I got kmalloc redzoning right, I think there's still a
> > >>> potential issue. Let's say we have a system with 128-byte DMA alignment
> > >>> required (the largest cache line size). We do a kmalloc(104) and
> > >>> kmalloc_size_roundup() returns 128, so all seems good to the DMA code.
> > >>> However, kmalloc() redzones from 104 to 128 as it tracks the original
> > >>> size. The DMA bouncing doesn't spot it since the
> > >>> kmalloc_size_roundup(104) is aligned to 128.
> > >>
> > >> Note that kmalloc_size_roundup() is supposed to be used *before*
> > >> kmalloc(), such as dma_resv_list_alloc() does. Then there's no issue as
> > >> no redzoning would not be done between 104 and 128, there would be only
> > >> the additional redzone at 128+.
> > > 
> > > Yes, if people use it this way. devm_kmalloc() via alloc_dr() also seems
> > > to be handling this. However, given the original report, I assume there
> > 
> > We can probably ignore my original private discussion as motivation as
> > it wasn't confirmed (and I'm not sure it will) that it was really a case
> > involving DMA alignment. It was just something I thought might be
> > possible explanation and wanted to doublecheck with people more
> > knowledgeable.
> > 
> > Unless you mean original report as 120ee599b5bf ("staging: octeon-usb:
> > prevent memory corruption") that Feng mentioned.
> 
> I was referring to your private discussion. IIUC the one Feng mentioned
> was about the SLOB allocator which I recall did not guarantee natural
> alignment for power-of-two allocations.
> 
> > > are drivers that have a problem with redzoning at the end of the buffer.
> > 
> > So I'm not aware of any issues reported due to the extended redzoning.

Me either.

> Thanks for confirming. I guess we can ignore this potential issue then
> as long as drivers take care of the alignment or use devm_kmalloc().

Yes, I agree it's better to let driver take care of the alignment part.
IMHO, touching the memory beyond its original requested size is kind of
abusing, no matter it's software intentional or 'unexpected' hardware
behavior. kmalloc_size_roundup() patchset was initially introduced to
help reducing potential similar issues:
https://lore.kernel.org/lkml/20220922031013.2150682-1-keescook@chromium.org/t/#u

Thanks,
Feng

> 
> > > I did a quick test with kmem_cache_create() of 104 bytes with
> > > SLAB_HWCACHE_ALIGN (64 bytes) and it has a similar problem with the
> > > redzone from byte 104 onwards. Here we don't have the equivalent of
> > > kmalloc_size_roundup() that a driver can use.
> > 
> > AFAIK SLAB_HWCACHE_ALIGN exists for performance reasons, not to provide
> > dma guarantees as kmalloc(). So I'd say users of kmem_cache_create()
> > would have to do their own rounding - you mentioned
> > dma_get_cache_alignment()? And there's an align parameter too when
> > creating caches.
> 
> I just checked and the align parameter only ensures the start of the
> buffer, the redzone start is not aligned.
> 
> Anyway, as in the other subthread with Petr, I think most architectures
> would benefit from an update to the DMA cache maintenance to avoid
> corrupting the redzone.
> 
> -- 
> Catalin


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2025-04-10  1:54 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-04  9:30 slub - extended kmalloc redzone and dma alignment Vlastimil Babka
2025-04-04 10:30 ` Harry Yoo
2025-04-04 11:12   ` Petr Tesarik
2025-04-04 12:45     ` Vlastimil Babka
2025-04-04 13:53       ` Petr Tesarik
2025-04-06 14:02         ` Feng Tang
2025-04-07  7:21           ` Feng Tang
2025-04-07  7:54             ` Vlastimil Babka
2025-04-07  9:50               ` Petr Tesarik
2025-04-07 17:12               ` Catalin Marinas
2025-04-08  5:27                 ` Petr Tesarik
2025-04-08 15:07                   ` Catalin Marinas
2025-04-09  8:39                     ` Petr Tesarik
2025-04-09  9:05                       ` Petr Tesarik
2025-04-09  9:47                         ` Catalin Marinas
2025-04-09 12:18                           ` Petr Tesarik
2025-04-09 12:49                             ` Catalin Marinas
2025-04-09 13:41                               ` Petr Tesarik
2025-04-09  8:51                     ` Vlastimil Babka
2025-04-09 11:11                       ` Catalin Marinas
2025-04-09 12:22                         ` Vlastimil Babka
2025-04-09 14:30                           ` Catalin Marinas
2025-04-10  1:54                             ` Feng Tang
2025-04-07  7:45         ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).