* [PATCH] dma: altera-msgdma: Replace memcpy with io32write in msgdma_copy_one
@ 2026-05-19 6:47 tze.yee.ng
2026-05-19 7:18 ` sashiko-bot
2026-05-20 21:29 ` Frank Li
0 siblings, 2 replies; 4+ messages in thread
From: tze.yee.ng @ 2026-05-19 6:47 UTC (permalink / raw)
To: Olivier Dautricourt, Stefan Roese, Vinod Koul, Frank Li,
dmaengine, linux-kernel
Cc: Tze Yee Ng, Adrian Ng Ho Yin, Nazim Amirul
From: Adrian Ng Ho Yin <adrianhoyin.ng@altera.com>
The descriptor FIFO requires that all words of a descriptor are written
in order, with the control word written last to flush it into the DMA
engine. Using memcpy() is unsafe since it does not guarantee ordering of
MMIO writes across all architectures.
Replace memcpy() with an explicit iowrite32() loop for each 32-bit word
(except the control word). The control word is still written separately,
with write barriers, to ensure it is always the final word pushed into
the FIFO.
This makes the programming of descriptors fully deterministic and
portable across different architectures.
Signed-off-by: Adrian Ng Ho Yin <adrianhoyin.ng@altera.com>
Signed-off-by: Tze Yee Ng <tze.yee.ng@altera.com>
---
drivers/dma/altera-msgdma.c | 21 ++++++++++++---------
1 file changed, 12 insertions(+), 9 deletions(-)
diff --git a/drivers/dma/altera-msgdma.c b/drivers/dma/altera-msgdma.c
index b46999c81df0..5816973d2c70 100644
--- a/drivers/dma/altera-msgdma.c
+++ b/drivers/dma/altera-msgdma.c
@@ -495,6 +495,9 @@ static void msgdma_copy_one(struct msgdma_device *mdev,
struct msgdma_sw_desc *desc)
{
void __iomem *hw_desc = mdev->desc;
+ const u32 *src = (const u32 *)&desc->hw_desc;
+ unsigned int i, nwords = offsetof(struct msgdma_extended_desc, control) /
+ sizeof(u32);
/*
* Check if the DESC FIFO it not full. If its full, we need to wait
@@ -505,16 +508,16 @@ static void msgdma_copy_one(struct msgdma_device *mdev,
mdelay(1);
/*
- * The descriptor needs to get copied into the descriptor FIFO
- * of the DMA controller. The descriptor will get flushed to the
- * FIFO, once the last word (control word) is written. Since we
- * are not 100% sure that memcpy() writes all word in the "correct"
- * order (address from low to high) on all architectures, we make
- * sure this control word is written last by single coding it and
- * adding some write-barriers here.
+ * The descriptor must be written into the descriptor FIFO of the DMA
+ * controller. The FIFO is flushed and the descriptor becomes valid once
+ * the last word (the control word) is written. To guarantee the ordering
+ * of MMIO writes across all architectures, we write each 32-bit word
+ * individually using iowrite32(), and handle the control word separately
+ * at the end. This ensures the control word is always written last and
+ * prevents memcpy() or the compiler from reordering accesses.
*/
- memcpy((void __force *)hw_desc, &desc->hw_desc,
- sizeof(desc->hw_desc) - sizeof(u32));
+ for (i = 0; i < nwords; i++)
+ iowrite32(src[i], hw_desc + i * sizeof(u32));
/* Write control word last to flush this descriptor into the FIFO */
mdev->idle = false;
--
2.43.7
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] dma: altera-msgdma: Replace memcpy with io32write in msgdma_copy_one
2026-05-19 6:47 [PATCH] dma: altera-msgdma: Replace memcpy with io32write in msgdma_copy_one tze.yee.ng
@ 2026-05-19 7:18 ` sashiko-bot
2026-05-20 2:58 ` NG, TZE YEE
2026-05-20 21:29 ` Frank Li
1 sibling, 1 reply; 4+ messages in thread
From: sashiko-bot @ 2026-05-19 7:18 UTC (permalink / raw)
To: tze.yee.ng; +Cc: Frank.Li, dmaengine, vkoul
Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Unbounded busy-waiting with mdelay(1) inside a hard IRQ handler causes severe latency spikes and hard lockups. (Pre-existing issue)
- [High] Clearing the interrupt status after starting a new transfer creates a race condition leading to lost interrupts and a hung DMA pipeline. (Pre-existing issue)
- [Medium] The patch introduces excessive heavy memory barriers on the DMA fast path by using iowrite32() in a loop.
--
commit 64f4d2469744b57d2ca64e78f8cef229de2c5eae
Author: Adrian Ng Ho Yin <adrianhoyin.ng@altera.com>
dma: altera-msgdma: Replace memcpy with io32write in msgdma_copy_one
This commit replaces memcpy() with an explicit iowrite32() loop when
writing descriptors into the DMA controller's FIFO. This ensures that
MMIO writes are not reordered by the compiler or architecture, guaranteeing
that the control word is always written last.
> diff --git a/drivers/dma/altera-msgdma.c b/drivers/dma/altera-msgdma.c
> index b46999c81df00..5816973d2c70a 100644
> --- a/drivers/dma/altera-msgdma.c
> +++ b/drivers/dma/altera-msgdma.c
> @@ -495,6 +495,9 @@ static void msgdma_copy_one(struct msgdma_device *mdev,
> struct msgdma_sw_desc *desc)
> {
> void __iomem *hw_desc = mdev->desc;
> + const u32 *src = (const u32 *)&desc->hw_desc;
> + unsigned int i, nwords = offsetof(struct msgdma_extended_desc, control) /
> + sizeof(u32);
>
> /*
> * Check if the DESC FIFO it not full. If its full, we need to wait
> @@ -505,16 +508,16 @@ static void msgdma_copy_one(struct msgdma_device *mdev,
> mdelay(1);
This isn't a regression from this patch, but can this unbounded busy-wait
inside a hard IRQ handler cause severe latency spikes and hard lockups?
If a large scatter-gather list is submitted that exceeds the hardware FIFO
size, msgdma_start_transfer() will attempt to push all of them at once.
The CPU will spin in mdelay(1) for each subsequent descriptor while the DMA
engine processes the transfers.
Waiting for hundreds of milliseconds with IRQs disabled could trigger a kernel
hard lockup watchdog panic. Furthermore, if the hardware stalls due to a bus
error, the CPU will spin forever.
Would it be possible to implement a proper timeout mechanism and software
queuing here instead?
[ ... ]
> - * The descriptor needs to get copied into the descriptor FIFO
> - * of the DMA controller. The descriptor will get flushed to the
> - * FIFO, once the last word (control word) is written. Since we
> - * are not 100% sure that memcpy() writes all word in the "correct"
> - * order (address from low to high) on all architectures, we make
> - * sure this control word is written last by single coding it and
> - * adding some write-barriers here.
> + * The descriptor must be written into the descriptor FIFO of the DMA
> + * controller. The FIFO is flushed and the descriptor becomes valid once
> + * the last word (the control word) is written. To guarantee the ordering
> + * of MMIO writes across all architectures, we write each 32-bit word
> + * individually using iowrite32(), and handle the control word separately
> + * at the end. This ensures the control word is always written last and
> + * prevents memcpy() or the compiler from reordering accesses.
> */
> - memcpy((void __force *)hw_desc, &desc->hw_desc,
> - sizeof(desc->hw_desc) - sizeof(u32));
> + for (i = 0; i < nwords; i++)
> + iowrite32(src[i], hw_desc + i * sizeof(u32));
Does this introduce excessive heavy memory barriers on the DMA fast path?
The iowrite32() macro maps to writel(), which includes an implicit heavy
memory barrier before the write to ensure memory accesses are visible to the
device.
Executing this barrier in a tight loop for every word of the DMA descriptor
could significantly degrade performance on the submission hot path.
Since the hardware merely requires the descriptor words to be written
sequentially before writing the final control word, could writel_relaxed()
be used instead to avoid the per-word barrier overhead? An explicit wmb()
is already present before the final control word write to ensure correct
ordering.
This isn't a regression from this patch, but could clearing the interrupt
status after starting a new transfer create a race condition leading to
lost interrupts and a hung DMA pipeline?
drivers/dma/altera-msgdma.c:msgdma_irq_handler() {
if ((status & MSGDMA_CSR_STAT_BUSY) == 0) {
/* Start next transfer if the DMA controller is idle */
spin_lock(&mdev->lock);
mdev->idle = true;
msgdma_start_transfer(mdev);
spin_unlock(&mdev->lock);
}
tasklet_schedule(&mdev->irq_tasklet);
/* Clear interrupt in mSGDMA controller */
iowrite32(MSGDMA_CSR_STAT_IRQ, mdev->csr + MSGDMA_CSR_STATUS);
}
If the newly started DMA transfer completes extremely quickly, before the
CPU reaches the iowrite32() instruction, the hardware will assert the IRQ
bit again.
The unconditional clear will then clear this new interrupt event, and because
the event is lost, the tasklet will not be scheduled for the second transfer.
On systems without a response FIFO, the tasklet completes exactly one
descriptor per run, so losing the interrupt could strand the second descriptor
in the active_list permanently.
Should the interrupt status be cleared before initiating the next transfer?
--
Sashiko AI review · https://sashiko.dev/#/patchset/4586c39b43aa3b9480989940fe905dac40c8cefc.1779173156.git.tze.yee.ng@altera.com?part=1
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] dma: altera-msgdma: Replace memcpy with io32write in msgdma_copy_one
2026-05-19 7:18 ` sashiko-bot
@ 2026-05-20 2:58 ` NG, TZE YEE
0 siblings, 0 replies; 4+ messages in thread
From: NG, TZE YEE @ 2026-05-20 2:58 UTC (permalink / raw)
To: sashiko-reviews@lists.linux.dev
Cc: Frank.Li@kernel.org, dmaengine@vger.kernel.org, vkoul@kernel.org
On 19/5/2026 3:18 pm, sashiko-bot@kernel.org wrote:
> [You don't often get email from sashiko-bot@kernel.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
> - [High] Unbounded busy-waiting with mdelay(1) inside a hard IRQ handler causes severe latency spikes and hard lockups. (Pre-existing issue)
> - [High] Clearing the interrupt status after starting a new transfer creates a race condition leading to lost interrupts and a hung DMA pipeline. (Pre-existing issue)
> - [Medium] The patch introduces excessive heavy memory barriers on the DMA fast path by using iowrite32() in a loop.
> --
>
> commit 64f4d2469744b57d2ca64e78f8cef229de2c5eae
> Author: Adrian Ng Ho Yin <adrianhoyin.ng@altera.com>
>
> dma: altera-msgdma: Replace memcpy with io32write in msgdma_copy_one
>
> This commit replaces memcpy() with an explicit iowrite32() loop when
> writing descriptors into the DMA controller's FIFO. This ensures that
> MMIO writes are not reordered by the compiler or architecture, guaranteeing
> that the control word is always written last.
>
>> diff --git a/drivers/dma/altera-msgdma.c b/drivers/dma/altera-msgdma.c
>> index b46999c81df00..5816973d2c70a 100644
>> --- a/drivers/dma/altera-msgdma.c
>> +++ b/drivers/dma/altera-msgdma.c
>> @@ -495,6 +495,9 @@ static void msgdma_copy_one(struct msgdma_device *mdev,
>> struct msgdma_sw_desc *desc)
>> {
>> void __iomem *hw_desc = mdev->desc;
>> + const u32 *src = (const u32 *)&desc->hw_desc;
>> + unsigned int i, nwords = offsetof(struct msgdma_extended_desc, control) /
>> + sizeof(u32);
>>
>> /*
>> * Check if the DESC FIFO it not full. If its full, we need to wait
>> @@ -505,16 +508,16 @@ static void msgdma_copy_one(struct msgdma_device *mdev,
>> mdelay(1);
>
> This isn't a regression from this patch, but can this unbounded busy-wait
> inside a hard IRQ handler cause severe latency spikes and hard lockups?
>
> If a large scatter-gather list is submitted that exceeds the hardware FIFO
> size, msgdma_start_transfer() will attempt to push all of them at once.
> The CPU will spin in mdelay(1) for each subsequent descriptor while the DMA
> engine processes the transfers.
>
> Waiting for hundreds of milliseconds with IRQs disabled could trigger a kernel
> hard lockup watchdog panic. Furthermore, if the hardware stalls due to a bus
> error, the CPU will spin forever.
>
> Would it be possible to implement a proper timeout mechanism and software
> queuing here instead?
>
> [ ... ]
>
>> - * The descriptor needs to get copied into the descriptor FIFO
>> - * of the DMA controller. The descriptor will get flushed to the
>> - * FIFO, once the last word (control word) is written. Since we
>> - * are not 100% sure that memcpy() writes all word in the "correct"
>> - * order (address from low to high) on all architectures, we make
>> - * sure this control word is written last by single coding it and
>> - * adding some write-barriers here.
>> + * The descriptor must be written into the descriptor FIFO of the DMA
>> + * controller. The FIFO is flushed and the descriptor becomes valid once
>> + * the last word (the control word) is written. To guarantee the ordering
>> + * of MMIO writes across all architectures, we write each 32-bit word
>> + * individually using iowrite32(), and handle the control word separately
>> + * at the end. This ensures the control word is always written last and
>> + * prevents memcpy() or the compiler from reordering accesses.
>> */
>> - memcpy((void __force *)hw_desc, &desc->hw_desc,
>> - sizeof(desc->hw_desc) - sizeof(u32));
>> + for (i = 0; i < nwords; i++)
>> + iowrite32(src[i], hw_desc + i * sizeof(u32));
>
> Does this introduce excessive heavy memory barriers on the DMA fast path?
>
> The iowrite32() macro maps to writel(), which includes an implicit heavy
> memory barrier before the write to ensure memory accesses are visible to the
> device.
>
> Executing this barrier in a tight loop for every word of the DMA descriptor
> could significantly degrade performance on the submission hot path.
>
> Since the hardware merely requires the descriptor words to be written
> sequentially before writing the final control word, could writel_relaxed()
> be used instead to avoid the per-word barrier overhead? An explicit wmb()
> is already present before the final control word write to ensure correct
> ordering.
>
> This isn't a regression from this patch, but could clearing the interrupt
> status after starting a new transfer create a race condition leading to
> lost interrupts and a hung DMA pipeline?
>
> drivers/dma/altera-msgdma.c:msgdma_irq_handler() {
> if ((status & MSGDMA_CSR_STAT_BUSY) == 0) {
> /* Start next transfer if the DMA controller is idle */
> spin_lock(&mdev->lock);
> mdev->idle = true;
> msgdma_start_transfer(mdev);
> spin_unlock(&mdev->lock);
> }
>
> tasklet_schedule(&mdev->irq_tasklet);
>
> /* Clear interrupt in mSGDMA controller */
> iowrite32(MSGDMA_CSR_STAT_IRQ, mdev->csr + MSGDMA_CSR_STATUS);
> }
>
> If the newly started DMA transfer completes extremely quickly, before the
> CPU reaches the iowrite32() instruction, the hardware will assert the IRQ
> bit again.
>
> The unconditional clear will then clear this new interrupt event, and because
> the event is lost, the tasklet will not be scheduled for the second transfer.
>
> On systems without a response FIFO, the tasklet completes exactly one
> descriptor per run, so losing the interrupt could strand the second descriptor
> in the active_list permanently.
>
> Should the interrupt status be cleared before initiating the next transfer?
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/4586c39b43aa3b9480989940fe905dac40c8cefc.1779173156.git.tze.yee.ng@altera.com?part=1
Hi maintainers,
Thanks for the review.
This patch only replaces memcpy() with ordered per-word MMIO writes for
the descriptor FIFO; it does not change the mdelay() FIFO back-pressure
loop or msgdma_irq_handler() (ack ordering / msgdma_start_transfer() in
hardirq). Those issues predate this series—I can send separate fixes if
you want them, but I'd prefer not to mix them into this correctness patch.
On iowrite32() vs writel_relaxed(): the whole driver uses
ioread32/iowrite32 today (no _relaxed anywhere). I used iowrite32() to
match that and to be conservative on ordering. If you prefer
writel_relaxed() for the body words plus wmb() before the control word,
I'm happy to do that here or in a quick follow-up once you confirm
that's safe for this IP.
Please let me know which approach you prefer.
Thanks,
Tze Yee
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] dma: altera-msgdma: Replace memcpy with io32write in msgdma_copy_one
2026-05-19 6:47 [PATCH] dma: altera-msgdma: Replace memcpy with io32write in msgdma_copy_one tze.yee.ng
2026-05-19 7:18 ` sashiko-bot
@ 2026-05-20 21:29 ` Frank Li
1 sibling, 0 replies; 4+ messages in thread
From: Frank Li @ 2026-05-20 21:29 UTC (permalink / raw)
To: tze.yee.ng
Cc: Olivier Dautricourt, Stefan Roese, Vinod Koul, Frank Li,
dmaengine, linux-kernel, Adrian Ng Ho Yin, Nazim Amirul
On Mon, May 18, 2026 at 11:47:20PM -0700, tze.yee.ng@altera.com wrote:
> From: Adrian Ng Ho Yin <adrianhoyin.ng@altera.com>
>
> The descriptor FIFO requires that all words of a descriptor are written
> in order, with the control word written last to flush it into the DMA
> engine. Using memcpy() is unsafe since it does not guarantee ordering of
> MMIO writes across all architectures.
>
> Replace memcpy() with an explicit iowrite32() loop for each 32-bit word
> (except the control word). The control word is still written separately,
> with write barriers, to ensure it is always the final word pushed into
> the FIFO.
>
> This makes the programming of descriptors fully deterministic and
> portable across different architectures.
>
> Signed-off-by: Adrian Ng Ho Yin <adrianhoyin.ng@altera.com>
> Signed-off-by: Tze Yee Ng <tze.yee.ng@altera.com>
> ---
> drivers/dma/altera-msgdma.c | 21 ++++++++++++---------
> 1 file changed, 12 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/dma/altera-msgdma.c b/drivers/dma/altera-msgdma.c
> index b46999c81df0..5816973d2c70 100644
> --- a/drivers/dma/altera-msgdma.c
> +++ b/drivers/dma/altera-msgdma.c
> @@ -495,6 +495,9 @@ static void msgdma_copy_one(struct msgdma_device *mdev,
> struct msgdma_sw_desc *desc)
> {
> void __iomem *hw_desc = mdev->desc;
> + const u32 *src = (const u32 *)&desc->hw_desc;
> + unsigned int i, nwords = offsetof(struct msgdma_extended_desc, control) /
> + sizeof(u32);
>
> /*
> * Check if the DESC FIFO it not full. If its full, we need to wait
> @@ -505,16 +508,16 @@ static void msgdma_copy_one(struct msgdma_device *mdev,
> mdelay(1);
>
> /*
> - * The descriptor needs to get copied into the descriptor FIFO
> - * of the DMA controller. The descriptor will get flushed to the
> - * FIFO, once the last word (control word) is written. Since we
> - * are not 100% sure that memcpy() writes all word in the "correct"
> - * order (address from low to high) on all architectures, we make
> - * sure this control word is written last by single coding it and
> - * adding some write-barriers here.
> + * The descriptor must be written into the descriptor FIFO of the DMA
> + * controller. The FIFO is flushed and the descriptor becomes valid once
> + * the last word (the control word) is written. To guarantee the ordering
> + * of MMIO writes across all architectures, we write each 32-bit word
> + * individually using iowrite32(), and handle the control word separately
> + * at the end. This ensures the control word is always written last and
> + * prevents memcpy() or the compiler from reordering accesses.
> */
> - memcpy((void __force *)hw_desc, &desc->hw_desc,
> - sizeof(desc->hw_desc) - sizeof(u32));
> + for (i = 0; i < nwords; i++)
> + iowrite32(src[i], hw_desc + i * sizeof(u32));
why not use memcpy_toio()?
Frank
>
> /* Write control word last to flush this descriptor into the FIFO */
> mdev->idle = false;
> --
> 2.43.7
>
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-05-20 21:30 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-19 6:47 [PATCH] dma: altera-msgdma: Replace memcpy with io32write in msgdma_copy_one tze.yee.ng
2026-05-19 7:18 ` sashiko-bot
2026-05-20 2:58 ` NG, TZE YEE
2026-05-20 21:29 ` Frank Li
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox