* [PATCH] iommu/arm-smmu-v3: Shrink command/event/PRI queues in kdump kernel
@ 2026-07-01 15:45 Kiryl Shutsemau (Meta)
2026-07-02 0:16 ` Jason Gunthorpe
0 siblings, 1 reply; 3+ messages in thread
From: Kiryl Shutsemau (Meta) @ 2026-07-01 15:45 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel
Cc: Jason Gunthorpe, Nicolin Chen, Kyle McMartin, Breno Leitao,
Usama Arif, linux-arm-kernel, iommu, linux-kernel,
Kiryl Shutsemau (Meta)
The command, event and PRI queues are sized from the maxima the hardware
advertises in IDR1, which can be several megabytes each. On systems with
many SMMUv3 instances that cost is paid per instance and adds up to tens
of megabytes of coherent DMA in the capture kernel.
A kdump capture kernel runs from a small crashkernel reservation and only
has to drive the few devices used to save the dump, so deep queues serve
no purpose. The queues carry invalidation commands and fault records, not
DMA data, so dump throughput is unaffected; a shallower queue only bounds
how many commands may be in flight before a sync, which does not matter for
the capture kernel's small device count and modest I/O.
Clamp every queue to a single page when is_kdump_kernel() is true. Doing
it in arm_smmu_init_one_queue() covers the command, event and PRI queues
in one place. The command queue still holds at least one batch plus a sync
(256 entries on a 4K-page kernel, well above CMDQ_BATCH_ENTRIES), so
command batching keeps working.
Suggested-by: Kyle McMartin <jkkm@meta.com>
Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index e8d7dbe495f0..6ec3ef5ee0da 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -4414,6 +4414,20 @@ int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
{
size_t qsz;
+ /*
+ * A kdump capture kernel runs from a small crashkernel reservation and
+ * only has to drive the few devices used to save the dump, so there is
+ * no point sizing the queues for the (multi-megabyte) maxima the
+ * hardware advertises. Clamp each queue to a single page. ent_sz_shift
+ * is the log2 of the entry size in bytes (dwords * 8).
+ */
+ if (is_kdump_kernel()) {
+ u32 ent_sz_shift = ilog2(dwords) + 3;
+
+ q->llq.max_n_shift = min_t(u32, q->llq.max_n_shift,
+ PAGE_SHIFT - ent_sz_shift);
+ }
+
do {
qsz = ((1 << q->llq.max_n_shift) * dwords) << 3;
q->base = dmam_alloc_coherent(smmu->dev, qsz, &q->base_dma,
--
2.54.0
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Shrink command/event/PRI queues in kdump kernel
2026-07-01 15:45 [PATCH] iommu/arm-smmu-v3: Shrink command/event/PRI queues in kdump kernel Kiryl Shutsemau (Meta)
@ 2026-07-02 0:16 ` Jason Gunthorpe
2026-07-02 8:24 ` Breno Leitao
0 siblings, 1 reply; 3+ messages in thread
From: Jason Gunthorpe @ 2026-07-02 0:16 UTC (permalink / raw)
To: Kiryl Shutsemau (Meta)
Cc: Will Deacon, Robin Murphy, Joerg Roedel, Nicolin Chen,
Kyle McMartin, Breno Leitao, Usama Arif, linux-arm-kernel, iommu,
linux-kernel
On Wed, Jul 01, 2026 at 04:45:28PM +0100, Kiryl Shutsemau (Meta) wrote:
> The command, event and PRI queues are sized from the maxima the hardware
> advertises in IDR1, which can be several megabytes each. On systems with
> many SMMUv3 instances that cost is paid per instance and adds up to tens
> of megabytes of coherent DMA in the capture kernel.
>
> A kdump capture kernel runs from a small crashkernel reservation and only
> has to drive the few devices used to save the dump, so deep queues serve
> no purpose. The queues carry invalidation commands and fault records, not
> DMA data, so dump throughput is unaffected; a shallower queue only bounds
> how many commands may be in flight before a sync, which does not matter for
> the capture kernel's small device count and modest I/O.
>
> Clamp every queue to a single page when is_kdump_kernel() is true. Doing
> it in arm_smmu_init_one_queue() covers the command, event and PRI queues
> in one place. The command queue still holds at least one batch plus a sync
> (256 entries on a 4K-page kernel, well above CMDQ_BATCH_ENTRIES), so
> command batching keeps working.
>
> Suggested-by: Kyle McMartin <jkkm@meta.com>
> Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
> ---
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
Make sense to me
> + if (is_kdump_kernel()) {
> + u32 ent_sz_shift = ilog2(dwords) + 3;
> +
> + q->llq.max_n_shift = min_t(u32, q->llq.max_n_shift,
> + PAGE_SHIFT - ent_sz_shift);
I saw lately many people saying you should not use min_t, why is it
needed here?
Jason
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Shrink command/event/PRI queues in kdump kernel
2026-07-02 0:16 ` Jason Gunthorpe
@ 2026-07-02 8:24 ` Breno Leitao
0 siblings, 0 replies; 3+ messages in thread
From: Breno Leitao @ 2026-07-02 8:24 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Kiryl Shutsemau (Meta), Will Deacon, Robin Murphy, Joerg Roedel,
Nicolin Chen, Kyle McMartin, Usama Arif, linux-arm-kernel, iommu,
linux-kernel
On Wed, Jul 01, 2026 at 09:16:03PM -0300, Jason Gunthorpe wrote:
> On Wed, Jul 01, 2026 at 04:45:28PM +0100, Kiryl Shutsemau (Meta) wrote:
> > The command, event and PRI queues are sized from the maxima the hardware
> > advertises in IDR1, which can be several megabytes each. On systems with
> > many SMMUv3 instances that cost is paid per instance and adds up to tens
> > of megabytes of coherent DMA in the capture kernel.
> >
> > A kdump capture kernel runs from a small crashkernel reservation and only
> > has to drive the few devices used to save the dump, so deep queues serve
> > no purpose. The queues carry invalidation commands and fault records, not
> > DMA data, so dump throughput is unaffected; a shallower queue only bounds
> > how many commands may be in flight before a sync, which does not matter for
> > the capture kernel's small device count and modest I/O.
> >
> > Clamp every queue to a single page when is_kdump_kernel() is true. Doing
> > it in arm_smmu_init_one_queue() covers the command, event and PRI queues
> > in one place. The command queue still holds at least one batch plus a sync
> > (256 entries on a 4K-page kernel, well above CMDQ_BATCH_ENTRIES), so
> > command batching keeps working.
> >
> > Suggested-by: Kyle McMartin <jkkm@meta.com>
> > Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
> > ---
> > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 14 ++++++++++++++
> > 1 file changed, 14 insertions(+)
>
> Make sense to me
>
> > + if (is_kdump_kernel()) {
> > + u32 ent_sz_shift = ilog2(dwords) + 3;
> > +
> > + q->llq.max_n_shift = min_t(u32, q->llq.max_n_shift,
> > + PAGE_SHIFT - ent_sz_shift);
>
> I saw lately many people saying you should not use min_t, why is it
> needed here?
Good point, it seems that both of them are u32
- q->llq.max_n_shift is u32 (struct arm_smmu_ll_queue)
- ent_sz_shift is u32, and PAGE_SHIFT is a small int constant, so
PAGE_SHIFT - ent_sz_shift promotes to u32 too.
min() should be enough, I would say. With that, please feel free to add:
Reviewed-by: Breno Leitao <leitao@debian.org>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-07-02 8:24 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-01 15:45 [PATCH] iommu/arm-smmu-v3: Shrink command/event/PRI queues in kdump kernel Kiryl Shutsemau (Meta)
2026-07-02 0:16 ` Jason Gunthorpe
2026-07-02 8:24 ` Breno Leitao
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox