* Re: [RFC PATCH V2] x86/VMBus: Confidential VMBus for dynamic DMA buffer transition
From: Aneesh Kumar K.V @ 2026-02-16 10:21 UTC (permalink / raw)
To: Robin Murphy, Michael Kelley, Tianyu Lan, kys@microsoft.com,
haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
longli@microsoft.com
Cc: Tianyu Lan, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org, hch@infradead.org,
vdso@hexbites.dev, Suzuki K Poulose
In-Reply-To: <cc4dc4a6-2d74-49c1-bbb0-cfa44802a66b@arm.com>
Robin Murphy <robin.murphy@arm.com> writes:
> On 2026-02-11 6:00 pm, Michael Kelley wrote:
>> From: Tianyu Lan <ltykernel@gmail.com> Sent: Tuesday, February 10, 2026 8:21 AM
>>>
>>> Hyper-V provides Confidential VMBus to communicate between
>>> device model and device guest driver via encrypted/private
>>> memory in Confidential VM. The device model is in OpenHCL
>>> (https://openvmm.dev/guide/user_guide/openhcl.html) that
>>> plays the paravisor rule.
>>>
>>> For a VMBUS device, there are two communication methods to
>>
>> s/VMBUS/VMBus/
>>
>>> talk with Host/Hypervisor. 1) VMBus Ring buffer 2) dynamic
>>> DMA transition.
>>
>> I'm not sure what "dynamic DMA transition" is. Maybe just
>> "DMA transfers"? Also, do the same substitution further
>> down in this commit message.
>>
>>> The Confidential VMBus Ring buffer has been
>>> upstreamed by Roman Kisel(commit 6802d8af).
>>
>> It's customary to use 12 character commit IDs, which would be
>> 6802d8af47d1 in this case.
>>
>>>
>>> The dynamic DMA transition of VMBus device normally goes
>>> through DMA core and it uses SWIOTLB as bounce buffer in
>>> CVM
>>
>> "CVM" is Microsoft-speak. The Linux terminology is "a CoCo VM".
>>
>>> to communicate with Host/Hypervisor. The Confidential
>>> VMBus device may use private/encrypted memory to do DMA
>>> and so the device swiotlb(bounce buffer) isn't necessary.
>>
>> The phrase "isn't necessary" does not capture the real issue
>> here. Saying "isn't necessary" makes it sound like this patch is
>> just avoids unnecessary work, so that it is a performance
>> improvement. But that's not the case.
>>
>> The real issue is that swiotlb memory is decrypted. So bouncing
>> through the swiotlb exposes to the host what is supposed to be
>> confidential data passed on the Confidential VMBus. Disabling
>> the swiotlb bouncing in this case is a hard requirement to preserve
>> confidentially.
>
> Yeah, this really isn't a Hyper-V problem. Indeed as things stand,
> "swiotlb=force" could potentially break confidentiality for any
> environment trying to invent a notion of private DMA, and perhaps we
> could throw a big warning about that, but really the answer there is
> "Don't run your confidential workload with 'swiotlb=force'. Why would
> you even do that? Debug your drivers in a regular VM or bare-metal with
> full debug visibility like a normal person..."
>
> The fact is we do not have a proper notion of trusted/private DMA yet,
> and this is not the way to add it. The current assumption is very much
> that all DMA is untrusted in the CoCo sense, because initially it was
> only virtual devices emulated by a hypervisor, thus had to be bounced
> through shared memory anyway. AMD SEV with a stage 1 IOMMU in the guest
> can allow an assigned physical device to access a suitably-aligned
> encrypted buffer directly, but that's still effectively just putting the
> buffer into a temporarily shared state for that device, it merely skips
> sharing it with the rest of the system. !force_dma_unencrypted() doesn't
> mean "we trust this device's DMA", it just means "we don't have to use
> explicitly-decrypted pages to accommodate untrusted/shared DMA here",
> plus it also serves double-duty for host encryption which doesn't share
> the same trust model anyway.
>
> I assumed this would follow the TDISP stuff, but if Hyper-V has an
> alternative device-trusting mechanism already then there's no need to
> wait. We want some common device property (likely consolidating the
> current PCI external-facing port notion of trustedness plus whatever
> TDISP wants), with which we can then make proper decisions in all the
> right DMA API paths - and if it can end up replacing the horrible
> force_dma_unencrypted() as well then all the better! I'd totally
> forgotten about the previous discussion that Michael referred to (which
> I had to track down[1]), but it looks like all the main points were
> already covered there and we were approaching a consensus, so really I
> guess someone just needs to give it a go.
>
With my device-assignment–related changes, I have made the following
update. It may be a slightly stronger requirement to enforce that
trusted device cannot use SWIOTLB, but it simplifies the overall design.
I also have a prototype, that added two default swiotlb, ie,
static struct io_tlb_mem io_tlb_default_mem;
static struct io_tlb_mem io_tlb_default_shared_mem;
Looking at that change, I would suggest we avoid doing this unless we
are certain that there is a requirement for a trusted device to use
SWIOTLB bouncing.
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index b27de03f2466..07ef149bd9fc 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -292,6 +292,9 @@ bool swiotlb_free(struct device *dev, struct page *page, size_t size);
static inline bool is_swiotlb_for_alloc(struct device *dev)
{
+ if (device_cc_accepted(dev))
+ return false;
+
return dev->dma_io_tlb_mem->for_alloc;
}
#else
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 34fe14b987f0..a89a7ac07499 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -159,6 +159,14 @@ static struct page *__dma_direct_alloc_pages(struct device *dev, size_t size,
*/
static bool dma_direct_use_pool(struct device *dev, gfp_t gfp)
{
+ /*
+ * Atomic pools are marked decrypted and are used if we require require
+ * updation of pfn mem encryption attributes or for DMA non-coherent
+ * device allocation. Both is not true for trusted device.
+ */
+ if (device_cc_accepted(dev))
+ return false;
+
return !gfpflags_allow_blocking(gfp) && !is_swiotlb_for_alloc(dev);
}
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index a862712f4dc6..6d9f0c869c6f 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -1643,6 +1643,9 @@ bool is_swiotlb_active(struct device *dev)
{
struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+ if (device_cc_accepted(dev))
+ return false;
+
return mem && mem->nslabs;
}
^ permalink raw reply related
* Re: [PATCH rdma-next 42/50] RDMA/bnxt_re: Complete CQ resize in a single step
From: Leon Romanovsky @ 2026-02-16 8:07 UTC (permalink / raw)
To: Selvin Xavier
Cc: Jason Gunthorpe, Kalesh AP, Potnuri Bharat Teja, Michael Margolin,
Gal Pressman, Yossi Leybovich, Cheng Xu, Kai Shen,
Chengchang Tang, Junxian Huang, Abhijit Gangurde, Allen Hubbe,
Krzysztof Czurylo, Tatyana Nikolova, Long Li, Konstantin Taranov,
Yishai Hadas, Michal Kalderon, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Christian Benvenuti,
Nelson Escobar, Dennis Dalessandro, Bernard Metzler, Zhu Yanjun,
linux-kernel, linux-rdma, linux-hyperv
In-Reply-To: <CA+sbYW0Gh2bLoPZKzH9u-EcWDTz6mbF3RB=6Q3q=m7YpUpNU6Q@mail.gmail.com>
On Mon, Feb 16, 2026 at 09:29:29AM +0530, Selvin Xavier wrote:
> On Fri, Feb 13, 2026 at 4:31 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > From: Leon Romanovsky <leonro@nvidia.com>
> >
> > There is no need to defer the CQ resize operation, as it is intended to
> > be completed in one pass. The current bnxt_re_resize_cq() implementation
> > does not handle concurrent CQ resize requests, and this will be addressed
> > in the following patches.
> bnxt HW requires that the previous CQ memory be available with the HW until
> HW generates a cut off cqe on the CQ that is being destroyed. This is
> the reason for
> polling the completions in the user library after returning the
> resize_cq call. Once the polling
> thread sees the expected CQE, it will invoke the driver to free CQ
> memory.
This flow is problematic. It requires the kernel to trust a user‑space
application, which is not acceptable. There is no guarantee that the
rdma-core implementation is correct or will invoke the interface properly.
Users can bypass rdma-core entirely and issue ioctls directly (syzkaller,
custom rdma-core variants, etc.), leading to umem leaks, races that overwrite
kernel memory, and access to fields that are now being modified. All of this
can occur silently and without any protections.
> So ib_umem_release should wait. This patch doesn't guarantee that.
The issue is that it was never guaranteed in the first place. It only appeared
to work under very controlled conditions.
> Do you think if there is a better way to handle this requirement?
You should wait for BNXT_RE_WC_TYPE_COFF in the kernel before returning
from resize_cq.
Thanks
>
> >
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> > drivers/infiniband/hw/bnxt_re/ib_verbs.c | 33 +++++++++-----------------------
> > 1 file changed, 9 insertions(+), 24 deletions(-)
> >
> > diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
> > index d652018c19b3..2aecfbbb7eaf 100644
> > --- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
> > +++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
> > @@ -3309,20 +3309,6 @@ int bnxt_re_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
> > return rc;
> > }
> >
> > -static void bnxt_re_resize_cq_complete(struct bnxt_re_cq *cq)
> > -{
> > - struct bnxt_re_dev *rdev = cq->rdev;
> > -
> > - bnxt_qplib_resize_cq_complete(&rdev->qplib_res, &cq->qplib_cq);
> > -
> > - cq->qplib_cq.max_wqe = cq->resize_cqe;
> > - if (cq->resize_umem) {
> > - ib_umem_release(cq->ib_cq.umem);
> > - cq->ib_cq.umem = cq->resize_umem;
> > - cq->resize_umem = NULL;
> > - cq->resize_cqe = 0;
> > - }
> > -}
> >
> > int bnxt_re_resize_cq(struct ib_cq *ibcq, unsigned int cqe,
> > struct ib_udata *udata)
> > @@ -3387,7 +3373,15 @@ int bnxt_re_resize_cq(struct ib_cq *ibcq, unsigned int cqe,
> > goto fail;
> > }
> >
> > - cq->ib_cq.cqe = cq->resize_cqe;
> > + bnxt_qplib_resize_cq_complete(&rdev->qplib_res, &cq->qplib_cq);
> > +
> > + cq->qplib_cq.max_wqe = cq->resize_cqe;
> > + ib_umem_release(cq->ib_cq.umem);
> > + cq->ib_cq.umem = cq->resize_umem;
> > + cq->resize_umem = NULL;
> > + cq->resize_cqe = 0;
> > +
> > + cq->ib_cq.cqe = entries;
> > atomic_inc(&rdev->stats.res.resize_count);
> >
> > return 0;
> > @@ -3907,15 +3901,6 @@ int bnxt_re_poll_cq(struct ib_cq *ib_cq, int num_entries, struct ib_wc *wc)
> > struct bnxt_re_sqp_entries *sqp_entry = NULL;
> > unsigned long flags;
> >
> > - /* User CQ; the only processing we do is to
> > - * complete any pending CQ resize operation.
> > - */
> > - if (cq->ib_cq.umem) {
> > - if (cq->resize_umem)
> > - bnxt_re_resize_cq_complete(cq);
> > - return 0;
> > - }
> > -
> > spin_lock_irqsave(&cq->cq_lock, flags);
> > budget = min_t(u32, num_entries, cq->max_cql);
> > num_entries = budget;
> >
> > --
> > 2.52.0
> >
^ permalink raw reply
* Re: [PATCH rdma-next 42/50] RDMA/bnxt_re: Complete CQ resize in a single step
From: Selvin Xavier @ 2026-02-16 3:59 UTC (permalink / raw)
To: Leon Romanovsky
Cc: Jason Gunthorpe, Kalesh AP, Potnuri Bharat Teja, Michael Margolin,
Gal Pressman, Yossi Leybovich, Cheng Xu, Kai Shen,
Chengchang Tang, Junxian Huang, Abhijit Gangurde, Allen Hubbe,
Krzysztof Czurylo, Tatyana Nikolova, Long Li, Konstantin Taranov,
Yishai Hadas, Michal Kalderon, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Christian Benvenuti,
Nelson Escobar, Dennis Dalessandro, Bernard Metzler, Zhu Yanjun,
linux-kernel, linux-rdma, linux-hyperv
In-Reply-To: <20260213-refactor-umem-v1-42-f3be85847922@nvidia.com>
[-- Attachment #1: Type: text/plain, Size: 3297 bytes --]
On Fri, Feb 13, 2026 at 4:31 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> From: Leon Romanovsky <leonro@nvidia.com>
>
> There is no need to defer the CQ resize operation, as it is intended to
> be completed in one pass. The current bnxt_re_resize_cq() implementation
> does not handle concurrent CQ resize requests, and this will be addressed
> in the following patches.
bnxt HW requires that the previous CQ memory be available with the HW until
HW generates a cut off cqe on the CQ that is being destroyed. This is
the reason for
polling the completions in the user library after returning the
resize_cq call. Once the polling
thread sees the expected CQE, it will invoke the driver to free CQ
memory. So ib_umem_release
should wait. This patch doesn't guarantee that. Do you think if there
is a better way to handle this requirement?
>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
> drivers/infiniband/hw/bnxt_re/ib_verbs.c | 33 +++++++++-----------------------
> 1 file changed, 9 insertions(+), 24 deletions(-)
>
> diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
> index d652018c19b3..2aecfbbb7eaf 100644
> --- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
> +++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
> @@ -3309,20 +3309,6 @@ int bnxt_re_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
> return rc;
> }
>
> -static void bnxt_re_resize_cq_complete(struct bnxt_re_cq *cq)
> -{
> - struct bnxt_re_dev *rdev = cq->rdev;
> -
> - bnxt_qplib_resize_cq_complete(&rdev->qplib_res, &cq->qplib_cq);
> -
> - cq->qplib_cq.max_wqe = cq->resize_cqe;
> - if (cq->resize_umem) {
> - ib_umem_release(cq->ib_cq.umem);
> - cq->ib_cq.umem = cq->resize_umem;
> - cq->resize_umem = NULL;
> - cq->resize_cqe = 0;
> - }
> -}
>
> int bnxt_re_resize_cq(struct ib_cq *ibcq, unsigned int cqe,
> struct ib_udata *udata)
> @@ -3387,7 +3373,15 @@ int bnxt_re_resize_cq(struct ib_cq *ibcq, unsigned int cqe,
> goto fail;
> }
>
> - cq->ib_cq.cqe = cq->resize_cqe;
> + bnxt_qplib_resize_cq_complete(&rdev->qplib_res, &cq->qplib_cq);
> +
> + cq->qplib_cq.max_wqe = cq->resize_cqe;
> + ib_umem_release(cq->ib_cq.umem);
> + cq->ib_cq.umem = cq->resize_umem;
> + cq->resize_umem = NULL;
> + cq->resize_cqe = 0;
> +
> + cq->ib_cq.cqe = entries;
> atomic_inc(&rdev->stats.res.resize_count);
>
> return 0;
> @@ -3907,15 +3901,6 @@ int bnxt_re_poll_cq(struct ib_cq *ib_cq, int num_entries, struct ib_wc *wc)
> struct bnxt_re_sqp_entries *sqp_entry = NULL;
> unsigned long flags;
>
> - /* User CQ; the only processing we do is to
> - * complete any pending CQ resize operation.
> - */
> - if (cq->ib_cq.umem) {
> - if (cq->resize_umem)
> - bnxt_re_resize_cq_complete(cq);
> - return 0;
> - }
> -
> spin_lock_irqsave(&cq->cq_lock, flags);
> budget = min_t(u32, num_entries, cq->max_cql);
> num_entries = budget;
>
> --
> 2.52.0
>
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5473 bytes --]
^ permalink raw reply
* Re: [PATCH rdma-next 29/50] RDMA/rxe: Split user and kernel CQ creation paths
From: Leon Romanovsky @ 2026-02-15 7:06 UTC (permalink / raw)
To: yanjun.zhu
Cc: Jason Gunthorpe, Selvin Xavier, Kalesh AP, Potnuri Bharat Teja,
Michael Margolin, Gal Pressman, Yossi Leybovich, Cheng Xu,
Kai Shen, Chengchang Tang, Junxian Huang, Abhijit Gangurde,
Allen Hubbe, Krzysztof Czurylo, Tatyana Nikolova, Long Li,
Konstantin Taranov, Yishai Hadas, Michal Kalderon, Bryan Tan,
Vishnu Dasa, Broadcom internal kernel review list,
Christian Benvenuti, Nelson Escobar, Dennis Dalessandro,
Bernard Metzler, Zhu Yanjun, linux-kernel, linux-rdma,
linux-hyperv
In-Reply-To: <459a01fe-8f23-4114-a127-98ec95c53464@linux.dev>
On Fri, Feb 13, 2026 at 03:22:13PM -0800, yanjun.zhu wrote:
> On 2/13/26 2:58 AM, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> >
> > Separate the CQ creation logic into distinct kernel and user flows.
> >
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> > drivers/infiniband/sw/rxe/rxe_verbs.c | 81 ++++++++++++++++++++---------------
> > 1 file changed, 47 insertions(+), 34 deletions(-)
<...>
> > + if (err)
> > + return err;
> > err = rxe_cq_from_init(rxe, cq, attr->cqe, attr->comp_vector, udata,
> > uresp);
>
> Neither rxe_create_user_cq() nor rxe_create_cq() explicitly validates
> attr->comp_vector. Is this guaranteed to be validated by the core before
> reaching the driver, or should rxe still enforce device-specific limits?
We should validate it in IB/core level.
https://github.com/linux-rdma/rdma-core/blob/8b9cdb7c6bd2b6e4e64e08888c10124b0d1873f2/libibverbs/man/ibv_create_cq.3#L32
.I comp_vector
for signaling completion events; it must be at least zero and less than
.I context\fR->num_comp_vectors.
>
> > - if (err) {
> > - rxe_dbg_cq(cq, "create cq failed, err = %d\n", err);
> > + if (err)
> > goto err_cleanup;
>
> The err_cleanup label is only used for this specific error path. It may
> improve readability to inline the cleanup logic at this site and remove the
> label altogether.
Ill delete. Thanks
^ permalink raw reply
* Re: [PATCH v2] x86: mshyperv: Use kthread for vmbus interrupts on PREEMPT_RT
From: Jan Kiszka @ 2026-02-14 14:23 UTC (permalink / raw)
To: Saurabh Singh Sengar
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
linux-hyperv, linux-kernel, Florian Bezdeka, RT, Mitchell Levy,
Michael Kelley
In-Reply-To: <aY/5eBLvodLYbV0a@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>
On 14.02.26 05:26, Saurabh Singh Sengar wrote:
> On Fri, Feb 13, 2026 at 08:20:31PM -0800, Saurabh Singh Sengar wrote:
>> On Fri, Feb 06, 2026 at 07:47:54AM +0100, Jan Kiszka wrote:
>>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>>
>>> Resolves the following lockdep report when booting PREEMPT_RT on Hyper-V
>>> with related guest support enabled:
>>>
>>> [ 1.127941] hv_vmbus: registering driver hyperv_drm
>>>
>>> [ 1.132518] =============================
>>> [ 1.132519] [ BUG: Invalid wait context ]
>>> [ 1.132521] 6.19.0-rc8+ #9 Not tainted
>>> [ 1.132524] -----------------------------
>>> [ 1.132525] swapper/0/0 is trying to lock:
>>> [ 1.132526] ffff8b9381bb3c90 (&channel->sched_lock){....}-{3:3}, at: vmbus_chan_sched+0xc4/0x2b0
>>> [ 1.132543] other info that might help us debug this:
>>> [ 1.132544] context-{2:2}
>>> [ 1.132545] 1 lock held by swapper/0/0:
>>> [ 1.132547] #0: ffffffffa010c4c0 (rcu_read_lock){....}-{1:3}, at: vmbus_chan_sched+0x31/0x2b0
>>> [ 1.132557] stack backtrace:
>>> [ 1.132560] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.19.0-rc8+ #9 PREEMPT_{RT,(lazy)}
>>> [ 1.132565] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 09/25/2025
>>> [ 1.132567] Call Trace:
>>> [ 1.132570] <IRQ>
>>> [ 1.132573] dump_stack_lvl+0x6e/0xa0
>>> [ 1.132581] __lock_acquire+0xee0/0x21b0
>>> [ 1.132592] lock_acquire+0xd5/0x2d0
>>> [ 1.132598] ? vmbus_chan_sched+0xc4/0x2b0
>>> [ 1.132606] ? lock_acquire+0xd5/0x2d0
>>> [ 1.132613] ? vmbus_chan_sched+0x31/0x2b0
>>> [ 1.132619] rt_spin_lock+0x3f/0x1f0
>>> [ 1.132623] ? vmbus_chan_sched+0xc4/0x2b0
>>> [ 1.132629] ? vmbus_chan_sched+0x31/0x2b0
>>> [ 1.132634] vmbus_chan_sched+0xc4/0x2b0
>>> [ 1.132641] vmbus_isr+0x2c/0x150
>>> [ 1.132648] __sysvec_hyperv_callback+0x5f/0xa0
>>> [ 1.132654] sysvec_hyperv_callback+0x88/0xb0
>>> [ 1.132658] </IRQ>
>>> [ 1.132659] <TASK>
>>> [ 1.132660] asm_sysvec_hyperv_callback+0x1a/0x20
>>>
>>> As code paths that handle vmbus IRQs use sleepy locks under PREEMPT_RT,
>>> the complete vmbus_handler execution needs to be moved into thread
>>> context. Open-coding this allows to skip the IPI that irq_work would
>>> additionally bring and which we do not need, being an IRQ, never an NMI.
>>>
>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>>
>> First I would like to share my opinion that, although support for the
>> RT kernel is not on the near-term roadmap, we should welcome RT Linux
>> patches.
>>
>> Coming back to this patch I can reproduce the stack trace referenced
>> in the commit when running with PREEMPT_RT enabled, and I have verified
>> that this patch resolves the issue. Next, I observed the storage-related
>> stack trace mentioned in Jan’s other patch; applying the storvsc patch
>> fixed that as well.
>>
>> However, when testing without PREEMPT_RT enabled, I see a another lockdep
>> warning below (both with and without Jan’s patches). IWanted to check if
>> is it possible to address this issue as part of the same fix ?
>> Doing so would make the change more useful beyond PREEMPT_RT.
>
>
> Sharing the stack, missed in previous email:
>
> [ 1.612866] =============================
> [ 1.616197] tun: Universal TUN/TAP device driver, 1.6
> [ 1.612866] [ BUG: Invalid wait context ]
> [ 1.612866] 6.19.0-next-20260212+ #8 Not tainted
> [ 1.612866] -----------------------------
> [ 1.612866] swapper/0/0 is trying to lock:
> [ 1.612866] ffff895a03dfd3f0 (&channel->sched_lock){....}-{3:3}, at: vmbus_chan_sched+0xda/0x350
> [ 1.621086] PPP generic driver version 2.4.2
> [ 1.612866] other info that might help us debug this:
> [ 1.612866] context-{2:2}
> [ 1.612866] 1 lock held by swapper/0/0:
> [ 1.612866] #0: ffffffff9b7d4660 (rcu_read_lock){....}-{1:3}, at: vmbus_chan_sched+0x38/0x350
Current context is LD_WAIT_SPIN (2), but the sched_lock is
LD_WAIT_CONFIG (3), which is invalid in RT. So this is warning us about
what this patch is addressing. But my patch likely misses some
lockdep_hardirq_threaded() before vmbus_handler is called to annotate
that fact.
Jan
> [ 1.628045] i8042: PNP: No PS/2 controller found.
> [ 1.612866] stack backtrace:
> [ 1.612866] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.19.0-next-20260212+ #8 PREEMPT
> [ 1.612866] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 09/25/2025
> [ 1.612866] Call Trace:
> [ 1.612866] <IRQ>
> [ 1.612866] dump_stack_lvl+0x6f/0xb0
> [ 1.612866] dump_stack+0x10/0x20
> [ 1.612866] __lock_acquire+0x973/0x24e0
> [ 1.612866] ? __lock_acquire+0x449/0x24e0
> [ 1.612866] lock_acquire+0xb6/0x2c0
> [ 1.612866] ? vmbus_chan_sched+0xda/0x350
> [ 1.612866] ? vmbus_chan_sched+0x38/0x350
> [ 1.612866] _raw_spin_lock+0x2f/0x50
> [ 1.612866] ? vmbus_chan_sched+0xda/0x350
> [ 1.612866] vmbus_chan_sched+0xda/0x350
> [ 1.612866] ? sched_clock_idle_wakeup_event+0x22/0x50
> [ 1.612866] vmbus_isr+0x26/0x170
> [ 1.612866] __sysvec_hyperv_callback+0x43/0x80
> [ 1.612866] sysvec_hyperv_callback+0x85/0xb0
> [ 1.612866] </IRQ>
> [ 1.612866] <TASK>
> [ 1.612866] asm_sysvec_hyperv_callback+0x1b/0x20
> [ 1.612866] RIP: 0010:pv_native_safe_halt+0xb/0x10
> [ 1.612866] Code: 48 2b 05 d8 00 97 00 5d c3 cc cc cc cc 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 0f 00 2d 99 33 1f 00 fb f4 <e9> 40 9e 01 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55
> [ 1.612866] RSP: 0000:ffffffff9b603dd8 EFLAGS: 00000242
> [ 1.612866] RAX: 0000000000040d85 RBX: 0000000000000000 RCX: 0000000000000000
> [ 1.612866] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff9981172f
> [ 1.612866] RBP: ffffffff9b603de0 R08: 0000000000000001 R09: 0000000000000000
> [ 1.612866] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
> [ 1.612866] R13: 0000000000000000 R14: 0000000000000000 R15: ffffffff9b628490
> [ 1.612866] ? do_idle+0x20f/0x290
> [ 1.612866] ? default_idle+0x9/0x20
> [ 1.612866] arch_cpu_idle+0x9/0x10
> [ 1.612866] default_idle_call+0x83/0x210
> [ 1.612866] do_idle+0x20f/0x290
> [ 1.612866] cpu_startup_entry+0x29/0x30
> [ 1.612866] rest_init+0x104/0x200
> [ 1.612866] start_kernel+0xa13/0xc90
> [ 1.612866] ? sme_unmap_bootdata+0x14/0x70
> [ 1.612866] x86_64_start_reservations+0x18/0x30
> [ 1.612866] x86_64_start_kernel+0x148/0x1a0
> [ 1.612866] ? soft_restart_cpu+0x14/0x14
> [ 1.612866] common_startup_64+0x13e/0x141
> [ 1.612866] </TASK>
>
> - Saurabh
--
Siemens AG, Foundational Technologies
Linux Expert Center
^ permalink raw reply
* Re: [PATCH v2] x86: mshyperv: Use kthread for vmbus interrupts on PREEMPT_RT
From: Saurabh Singh Sengar @ 2026-02-14 4:26 UTC (permalink / raw)
To: Jan Kiszka
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
linux-hyperv, linux-kernel, Florian Bezdeka, RT, Mitchell Levy,
Michael Kelley
In-Reply-To: <aY/4D/JVu7TjNOku@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>
On Fri, Feb 13, 2026 at 08:20:31PM -0800, Saurabh Singh Sengar wrote:
> On Fri, Feb 06, 2026 at 07:47:54AM +0100, Jan Kiszka wrote:
> > From: Jan Kiszka <jan.kiszka@siemens.com>
> >
> > Resolves the following lockdep report when booting PREEMPT_RT on Hyper-V
> > with related guest support enabled:
> >
> > [ 1.127941] hv_vmbus: registering driver hyperv_drm
> >
> > [ 1.132518] =============================
> > [ 1.132519] [ BUG: Invalid wait context ]
> > [ 1.132521] 6.19.0-rc8+ #9 Not tainted
> > [ 1.132524] -----------------------------
> > [ 1.132525] swapper/0/0 is trying to lock:
> > [ 1.132526] ffff8b9381bb3c90 (&channel->sched_lock){....}-{3:3}, at: vmbus_chan_sched+0xc4/0x2b0
> > [ 1.132543] other info that might help us debug this:
> > [ 1.132544] context-{2:2}
> > [ 1.132545] 1 lock held by swapper/0/0:
> > [ 1.132547] #0: ffffffffa010c4c0 (rcu_read_lock){....}-{1:3}, at: vmbus_chan_sched+0x31/0x2b0
> > [ 1.132557] stack backtrace:
> > [ 1.132560] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.19.0-rc8+ #9 PREEMPT_{RT,(lazy)}
> > [ 1.132565] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 09/25/2025
> > [ 1.132567] Call Trace:
> > [ 1.132570] <IRQ>
> > [ 1.132573] dump_stack_lvl+0x6e/0xa0
> > [ 1.132581] __lock_acquire+0xee0/0x21b0
> > [ 1.132592] lock_acquire+0xd5/0x2d0
> > [ 1.132598] ? vmbus_chan_sched+0xc4/0x2b0
> > [ 1.132606] ? lock_acquire+0xd5/0x2d0
> > [ 1.132613] ? vmbus_chan_sched+0x31/0x2b0
> > [ 1.132619] rt_spin_lock+0x3f/0x1f0
> > [ 1.132623] ? vmbus_chan_sched+0xc4/0x2b0
> > [ 1.132629] ? vmbus_chan_sched+0x31/0x2b0
> > [ 1.132634] vmbus_chan_sched+0xc4/0x2b0
> > [ 1.132641] vmbus_isr+0x2c/0x150
> > [ 1.132648] __sysvec_hyperv_callback+0x5f/0xa0
> > [ 1.132654] sysvec_hyperv_callback+0x88/0xb0
> > [ 1.132658] </IRQ>
> > [ 1.132659] <TASK>
> > [ 1.132660] asm_sysvec_hyperv_callback+0x1a/0x20
> >
> > As code paths that handle vmbus IRQs use sleepy locks under PREEMPT_RT,
> > the complete vmbus_handler execution needs to be moved into thread
> > context. Open-coding this allows to skip the IPI that irq_work would
> > additionally bring and which we do not need, being an IRQ, never an NMI.
> >
> > Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>
> First I would like to share my opinion that, although support for the
> RT kernel is not on the near-term roadmap, we should welcome RT Linux
> patches.
>
> Coming back to this patch I can reproduce the stack trace referenced
> in the commit when running with PREEMPT_RT enabled, and I have verified
> that this patch resolves the issue. Next, I observed the storage-related
> stack trace mentioned in Jan’s other patch; applying the storvsc patch
> fixed that as well.
>
> However, when testing without PREEMPT_RT enabled, I see a another lockdep
> warning below (both with and without Jan’s patches). IWanted to check if
> is it possible to address this issue as part of the same fix ?
> Doing so would make the change more useful beyond PREEMPT_RT.
Sharing the stack, missed in previous email:
[ 1.612866] =============================
[ 1.616197] tun: Universal TUN/TAP device driver, 1.6
[ 1.612866] [ BUG: Invalid wait context ]
[ 1.612866] 6.19.0-next-20260212+ #8 Not tainted
[ 1.612866] -----------------------------
[ 1.612866] swapper/0/0 is trying to lock:
[ 1.612866] ffff895a03dfd3f0 (&channel->sched_lock){....}-{3:3}, at: vmbus_chan_sched+0xda/0x350
[ 1.621086] PPP generic driver version 2.4.2
[ 1.612866] other info that might help us debug this:
[ 1.612866] context-{2:2}
[ 1.612866] 1 lock held by swapper/0/0:
[ 1.612866] #0: ffffffff9b7d4660 (rcu_read_lock){....}-{1:3}, at: vmbus_chan_sched+0x38/0x350
[ 1.628045] i8042: PNP: No PS/2 controller found.
[ 1.612866] stack backtrace:
[ 1.612866] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.19.0-next-20260212+ #8 PREEMPT
[ 1.612866] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 09/25/2025
[ 1.612866] Call Trace:
[ 1.612866] <IRQ>
[ 1.612866] dump_stack_lvl+0x6f/0xb0
[ 1.612866] dump_stack+0x10/0x20
[ 1.612866] __lock_acquire+0x973/0x24e0
[ 1.612866] ? __lock_acquire+0x449/0x24e0
[ 1.612866] lock_acquire+0xb6/0x2c0
[ 1.612866] ? vmbus_chan_sched+0xda/0x350
[ 1.612866] ? vmbus_chan_sched+0x38/0x350
[ 1.612866] _raw_spin_lock+0x2f/0x50
[ 1.612866] ? vmbus_chan_sched+0xda/0x350
[ 1.612866] vmbus_chan_sched+0xda/0x350
[ 1.612866] ? sched_clock_idle_wakeup_event+0x22/0x50
[ 1.612866] vmbus_isr+0x26/0x170
[ 1.612866] __sysvec_hyperv_callback+0x43/0x80
[ 1.612866] sysvec_hyperv_callback+0x85/0xb0
[ 1.612866] </IRQ>
[ 1.612866] <TASK>
[ 1.612866] asm_sysvec_hyperv_callback+0x1b/0x20
[ 1.612866] RIP: 0010:pv_native_safe_halt+0xb/0x10
[ 1.612866] Code: 48 2b 05 d8 00 97 00 5d c3 cc cc cc cc 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 0f 00 2d 99 33 1f 00 fb f4 <e9> 40 9e 01 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55
[ 1.612866] RSP: 0000:ffffffff9b603dd8 EFLAGS: 00000242
[ 1.612866] RAX: 0000000000040d85 RBX: 0000000000000000 RCX: 0000000000000000
[ 1.612866] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff9981172f
[ 1.612866] RBP: ffffffff9b603de0 R08: 0000000000000001 R09: 0000000000000000
[ 1.612866] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
[ 1.612866] R13: 0000000000000000 R14: 0000000000000000 R15: ffffffff9b628490
[ 1.612866] ? do_idle+0x20f/0x290
[ 1.612866] ? default_idle+0x9/0x20
[ 1.612866] arch_cpu_idle+0x9/0x10
[ 1.612866] default_idle_call+0x83/0x210
[ 1.612866] do_idle+0x20f/0x290
[ 1.612866] cpu_startup_entry+0x29/0x30
[ 1.612866] rest_init+0x104/0x200
[ 1.612866] start_kernel+0xa13/0xc90
[ 1.612866] ? sme_unmap_bootdata+0x14/0x70
[ 1.612866] x86_64_start_reservations+0x18/0x30
[ 1.612866] x86_64_start_kernel+0x148/0x1a0
[ 1.612866] ? soft_restart_cpu+0x14/0x14
[ 1.612866] common_startup_64+0x13e/0x141
[ 1.612866] </TASK>
- Saurabh
^ permalink raw reply
* Re: [PATCH v2] x86: mshyperv: Use kthread for vmbus interrupts on PREEMPT_RT
From: Saurabh Singh Sengar @ 2026-02-14 4:20 UTC (permalink / raw)
To: Jan Kiszka
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
linux-hyperv, linux-kernel, Florian Bezdeka, RT, Mitchell Levy,
Michael Kelley
In-Reply-To: <514e068c-1b85-4e39-8388-c1d2b106b4e9@siemens.com>
On Fri, Feb 06, 2026 at 07:47:54AM +0100, Jan Kiszka wrote:
> From: Jan Kiszka <jan.kiszka@siemens.com>
>
> Resolves the following lockdep report when booting PREEMPT_RT on Hyper-V
> with related guest support enabled:
>
> [ 1.127941] hv_vmbus: registering driver hyperv_drm
>
> [ 1.132518] =============================
> [ 1.132519] [ BUG: Invalid wait context ]
> [ 1.132521] 6.19.0-rc8+ #9 Not tainted
> [ 1.132524] -----------------------------
> [ 1.132525] swapper/0/0 is trying to lock:
> [ 1.132526] ffff8b9381bb3c90 (&channel->sched_lock){....}-{3:3}, at: vmbus_chan_sched+0xc4/0x2b0
> [ 1.132543] other info that might help us debug this:
> [ 1.132544] context-{2:2}
> [ 1.132545] 1 lock held by swapper/0/0:
> [ 1.132547] #0: ffffffffa010c4c0 (rcu_read_lock){....}-{1:3}, at: vmbus_chan_sched+0x31/0x2b0
> [ 1.132557] stack backtrace:
> [ 1.132560] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.19.0-rc8+ #9 PREEMPT_{RT,(lazy)}
> [ 1.132565] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 09/25/2025
> [ 1.132567] Call Trace:
> [ 1.132570] <IRQ>
> [ 1.132573] dump_stack_lvl+0x6e/0xa0
> [ 1.132581] __lock_acquire+0xee0/0x21b0
> [ 1.132592] lock_acquire+0xd5/0x2d0
> [ 1.132598] ? vmbus_chan_sched+0xc4/0x2b0
> [ 1.132606] ? lock_acquire+0xd5/0x2d0
> [ 1.132613] ? vmbus_chan_sched+0x31/0x2b0
> [ 1.132619] rt_spin_lock+0x3f/0x1f0
> [ 1.132623] ? vmbus_chan_sched+0xc4/0x2b0
> [ 1.132629] ? vmbus_chan_sched+0x31/0x2b0
> [ 1.132634] vmbus_chan_sched+0xc4/0x2b0
> [ 1.132641] vmbus_isr+0x2c/0x150
> [ 1.132648] __sysvec_hyperv_callback+0x5f/0xa0
> [ 1.132654] sysvec_hyperv_callback+0x88/0xb0
> [ 1.132658] </IRQ>
> [ 1.132659] <TASK>
> [ 1.132660] asm_sysvec_hyperv_callback+0x1a/0x20
>
> As code paths that handle vmbus IRQs use sleepy locks under PREEMPT_RT,
> the complete vmbus_handler execution needs to be moved into thread
> context. Open-coding this allows to skip the IPI that irq_work would
> additionally bring and which we do not need, being an IRQ, never an NMI.
>
> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
First I would like to share my opinion that, although support for the
RT kernel is not on the near-term roadmap, we should welcome RT Linux
patches.
Coming back to this patch I can reproduce the stack trace referenced
in the commit when running with PREEMPT_RT enabled, and I have verified
that this patch resolves the issue. Next, I observed the storage-related
stack trace mentioned in Jan’s other patch; applying the storvsc patch
fixed that as well.
However, when testing without PREEMPT_RT enabled, I see a another lockdep
warning below (both with and without Jan’s patches). IWanted to check if
is it possible to address this issue as part of the same fix ?
Doing so would make the change more useful beyond PREEMPT_RT.
> ---
>
> Changes in v2:
> - reorder vmbus_irq_pending clearing to fix a race condition
>
> arch/x86/kernel/cpu/mshyperv.c | 52 ++++++++++++++++++++++++++++++++--
> 1 file changed, 50 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index 579fb2c64cfd..b39cb983326a 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -17,6 +17,7 @@
> #include <linux/irq.h>
> #include <linux/kexec.h>
> #include <linux/random.h>
> +#include <linux/smpboot.h>
> #include <asm/processor.h>
> #include <asm/hypervisor.h>
> #include <hyperv/hvhdk.h>
> @@ -150,6 +151,43 @@ static void (*hv_stimer0_handler)(void);
> static void (*hv_kexec_handler)(void);
> static void (*hv_crash_handler)(struct pt_regs *regs);
>
> +static DEFINE_PER_CPU(bool, vmbus_irq_pending);
> +static DEFINE_PER_CPU(struct task_struct *, vmbus_irqd);
> +
> +static void vmbus_irqd_wake(void)
> +{
> + struct task_struct *tsk = __this_cpu_read(vmbus_irqd);
> +
> + __this_cpu_write(vmbus_irq_pending, true);
> + wake_up_process(tsk);
> +}
> +
> +static void vmbus_irqd_setup(unsigned int cpu)
> +{
> + sched_set_fifo(current);
> +}
> +
> +static int vmbus_irqd_should_run(unsigned int cpu)
> +{
> + return __this_cpu_read(vmbus_irq_pending);
> +}
> +
> +static void run_vmbus_irqd(unsigned int cpu)
> +{
> + __this_cpu_write(vmbus_irq_pending, false);
> + vmbus_handler();
> +}
> +
> +static bool vmbus_irq_initialized;
> +
> +static struct smp_hotplug_thread vmbus_irq_threads = {
> + .store = &vmbus_irqd,
> + .setup = vmbus_irqd_setup,
> + .thread_should_run = vmbus_irqd_should_run,
> + .thread_fn = run_vmbus_irqd,
> + .thread_comm = "vmbus_irq/%u",
> +};
> +
> DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
> {
> struct pt_regs *old_regs = set_irq_regs(regs);
> @@ -158,8 +196,12 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
> if (mshv_handler)
> mshv_handler();
>
> - if (vmbus_handler)
> - vmbus_handler();
> + if (vmbus_handler) {
> + if (IS_ENABLED(CONFIG_PREEMPT_RT))
> + vmbus_irqd_wake();
> + else
> + vmbus_handler();
> + }
>
> if (ms_hyperv.hints & HV_DEPRECATING_AEOI_RECOMMENDED)
> apic_eoi();
> @@ -174,6 +216,10 @@ void hv_setup_mshv_handler(void (*handler)(void))
>
> void hv_setup_vmbus_handler(void (*handler)(void))
> {
> + if (IS_ENABLED(CONFIG_PREEMPT_RT) && !vmbus_irq_initialized) {
> + BUG_ON(smpboot_register_percpu_thread(&vmbus_irq_threads));
> + vmbus_irq_initialized = true;
> + }
> vmbus_handler = handler;
> }
>
> @@ -181,6 +227,8 @@ void hv_remove_vmbus_handler(void)
> {
> /* We have no way to deallocate the interrupt gate */
> vmbus_handler = NULL;
> + smpboot_unregister_percpu_thread(&vmbus_irq_threads);
Do we want to safeguard this call only when vmbus_irq_initialized=true ?
- Saurabh
> + vmbus_irq_initialized = false;
> }
>
> /*
> --
> 2.51.0
^ permalink raw reply
* [PATCH AUTOSEL 6.19-5.10] hyper-v: Mark inner union in hv_kvp_exchg_msg_value as packed
From: Sasha Levin @ 2026-02-14 0:59 UTC (permalink / raw)
To: patches, stable
Cc: Thomas Weißschuh, kernel test robot, Nathan Chancellor,
Arnd Bergmann, Wei Liu (Microsoft), Nicolas Schier,
Greg Kroah-Hartman, Sasha Levin, kys, haiyangz, decui, longli,
linux-hyperv, llvm
In-Reply-To: <20260214010245.3671907-1-sashal@kernel.org>
From: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
[ Upstream commit 1e5271393d777f6159d896943b4c44c4f3ecff52 ]
The unpacked union within a packed struct generates alignment warnings
on clang for 32-bit ARM:
./usr/include/linux/hyperv.h:361:2: error: field within 'struct hv_kvp_exchg_msg_value'
is less aligned than 'union hv_kvp_exchg_msg_value::(anonymous at ./usr/include/linux/hyperv.h:361:2)'
and is usually due to 'struct hv_kvp_exchg_msg_value' being packed,
which can lead to unaligned accesses [-Werror,-Wunaligned-access]
361 | union {
| ^
With the recent changes to compile-test the UAPI headers in more cases,
this warning in combination with CONFIG_WERROR breaks the build.
Fix the warning.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202512140314.DzDxpIVn-lkp@intel.com/
Reported-by: Nathan Chancellor <nathan@kernel.org>
Closes: https://lore.kernel.org/linux-kbuild/20260110-uapi-test-disable-headers-arm-clang-unaligned-access-v1-1-b7b0fa541daa@kernel.org/
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/linux-kbuild/29b2e736-d462-45b7-a0a9-85f8d8a3de56@app.fastmail.com/
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Wei Liu (Microsoft) <wei.liu@kernel.org>
Tested-by: Nicolas Schier <nsc@kernel.org>
Reviewed-by: Nicolas Schier <nsc@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/20260115-kbuild-alignment-vbox-v1-1-076aed1623ff@linutronix.de
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
## Analysis
### Commit Message Analysis
This commit fixes a build error caused by an unpacked union within a
packed struct in a UAPI header (`include/uapi/linux/hyperv.h`). The
issue manifests as a `-Werror,-Wunaligned-access` error on clang for
32-bit ARM, which **breaks the build** when `CONFIG_WERROR` is enabled.
Key signals:
- **Two separate "Reported-by:" tags** — kernel test robot and Nathan
Chancellor (a prominent kernel build/clang developer)
- **Multiple "Closes:" links** to actual build failure reports
- **Tested-by** and **Reviewed-by** from Nicolas Schier
- **Acked-by** from subsystem maintainer (Wei Liu) and Greg Kroah-
Hartman himself
- Commit message explicitly says "breaks the build"
### Code Change Analysis
The change is a single-line modification:
```c
- };
+ } __attribute__((packed));
```
This adds the `packed` attribute to an anonymous union inside the
already-packed struct `hv_kvp_exchg_msg_value`. The outer struct is
already `__attribute__((packed))`, so adding `packed` to the inner union
aligns it with the containing struct's packing requirement, silencing
the clang warning.
**Functional impact**: This union contains `__u8 value[...]`, `__u32
value_u32`, and `__u64 value_u64`. Since the union is inside a packed
struct, the compiler should already be treating accesses as potentially
unaligned. Adding `packed` to the union itself makes this explicit and
resolves the inconsistency that triggers the warning. There is **no
change to the actual memory layout** — the struct was already packed,
and the union within it was already at whatever offset the packing
dictated. This just makes the annotation consistent.
### Classification
This is a **build fix** — one of the explicitly allowed categories for
stable backporting. It prevents compilation failure on a specific (and
common) configuration: clang + 32-bit ARM + CONFIG_WERROR.
### Scope and Risk Assessment
- **Lines changed**: 1 (literally changing `};` to `}
__attribute__((packed));`)
- **Files changed**: 1 UAPI header
- **Risk**: Extremely low. The packed attribute on the inner union is
semantically correct (the outer struct is already packed), and this
doesn't change the ABI or memory layout
- **Subsystem**: Hyper-V UAPI header, but the fix is really about build
correctness
### User Impact
- **Who is affected**: Anyone building the kernel with clang on 32-bit
ARM (or potentially other architectures in the future) with
`CONFIG_WERROR=y`
- **Severity**: Build breakage — users literally cannot compile the
kernel in this configuration
- **Frequency**: 100% reproducible in the affected configuration
### Stability Indicators
- Acked by Greg Kroah-Hartman (stable tree maintainer)
- Acked by Wei Liu (Hyper-V maintainer)
- Tested and reviewed by Nicolas Schier
- The fix is trivially correct — adding packed to a union inside a
packed struct
### Dependency Check
This commit is self-contained. It references "recent changes to compile-
test the UAPI headers in more cases" as the trigger that exposed this
warning, but the fix itself (adding packed to the union) is valid
regardless of whether those compile-test changes are present. The
underlying warning condition exists in any version of this header
compiled with clang on ARM.
However, I should check if the struct in question exists in older stable
trees.
The struct `hv_kvp_exchg_msg_value` with this union has been in
`include/uapi/linux/hyperv.h` for a very long time (it's part of the
Hyper-V KVP userspace interface). The fix would apply cleanly to any
stable tree that has this header.
### Conclusion
This is a textbook stable backport candidate:
- **Fixes a real build breakage** (not just a warning — it errors out
with CONFIG_WERROR)
- **Trivially small and obviously correct** — one attribute addition
- **Zero risk of regression** — no behavioral change, no ABI change
- **Well-reviewed** — acked by GKH, subsystem maintainer, tested and
reviewed
- **Multiple reporters** — real-world problem encountered by kernel test
infrastructure and developers
- Build fixes are explicitly listed as appropriate stable material
**YES**
include/uapi/linux/hyperv.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/uapi/linux/hyperv.h b/include/uapi/linux/hyperv.h
index aaa502a7bff46..1749b35ab2c21 100644
--- a/include/uapi/linux/hyperv.h
+++ b/include/uapi/linux/hyperv.h
@@ -362,7 +362,7 @@ struct hv_kvp_exchg_msg_value {
__u8 value[HV_KVP_EXCHANGE_MAX_VALUE_SIZE];
__u32 value_u32;
__u64 value_u64;
- };
+ } __attribute__((packed));
} __attribute__((packed));
struct hv_kvp_msg_enumerate {
--
2.51.0
^ permalink raw reply related
* Re: [PATCH rdma-next 29/50] RDMA/rxe: Split user and kernel CQ creation paths
From: yanjun.zhu @ 2026-02-13 23:22 UTC (permalink / raw)
To: Leon Romanovsky, Jason Gunthorpe, Selvin Xavier, Kalesh AP,
Potnuri Bharat Teja, Michael Margolin, Gal Pressman,
Yossi Leybovich, Cheng Xu, Kai Shen, Chengchang Tang,
Junxian Huang, Abhijit Gangurde, Allen Hubbe, Krzysztof Czurylo,
Tatyana Nikolova, Long Li, Konstantin Taranov, Yishai Hadas,
Michal Kalderon, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Christian Benvenuti,
Nelson Escobar, Dennis Dalessandro, Bernard Metzler, Zhu Yanjun
Cc: linux-kernel, linux-rdma, linux-hyperv
In-Reply-To: <20260213-refactor-umem-v1-29-f3be85847922@nvidia.com>
On 2/13/26 2:58 AM, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
>
> Separate the CQ creation logic into distinct kernel and user flows.
>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
> drivers/infiniband/sw/rxe/rxe_verbs.c | 81 ++++++++++++++++++++---------------
> 1 file changed, 47 insertions(+), 34 deletions(-)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
> index 38d8c408320f..1e651bdd8622 100644
> --- a/drivers/infiniband/sw/rxe/rxe_verbs.c
> +++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
> @@ -1072,58 +1072,70 @@ static int rxe_post_recv(struct ib_qp *ibqp, const struct ib_recv_wr *wr,
> }
>
> /* cq */
> -static int rxe_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
> - struct uverbs_attr_bundle *attrs)
> +static int rxe_create_user_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
> + struct uverbs_attr_bundle *attrs)
> {
> struct ib_udata *udata = &attrs->driver_udata;
> struct ib_device *dev = ibcq->device;
> struct rxe_dev *rxe = to_rdev(dev);
> struct rxe_cq *cq = to_rcq(ibcq);
> - struct rxe_create_cq_resp __user *uresp = NULL;
> - int err, cleanup_err;
> + struct rxe_create_cq_resp __user *uresp;
> + int err;
>
> - if (udata) {
> - if (udata->outlen < sizeof(*uresp)) {
> - err = -EINVAL;
> - rxe_dbg_dev(rxe, "malformed udata, err = %d\n", err);
> - goto err_out;
> - }
> - uresp = udata->outbuf;
> - }
> + if (udata->outlen < sizeof(*uresp))
> + return -EINVAL;
>
> - if (attr->flags) {
> - err = -EOPNOTSUPP;
> - rxe_dbg_dev(rxe, "bad attr->flags, err = %d\n", err);
> - goto err_out;
> - }
> + uresp = udata->outbuf;
>
> - err = rxe_cq_chk_attr(rxe, NULL, attr->cqe, attr->comp_vector);
> - if (err) {
> - rxe_dbg_dev(rxe, "bad init attributes, err = %d\n", err);
> - goto err_out;
> - }
> + if (attr->flags || ibcq->umem)
> + return -EOPNOTSUPP;
> +
> + if (attr->cqe > rxe->attr.max_cqe)
> + return -EINVAL;
>
> err = rxe_add_to_pool(&rxe->cq_pool, cq);
> - if (err) {
> - rxe_dbg_dev(rxe, "unable to create cq, err = %d\n", err);
> - goto err_out;
> - }
> + if (err)
> + return err;
>
> err = rxe_cq_from_init(rxe, cq, attr->cqe, attr->comp_vector, udata,
> uresp);
Neither rxe_create_user_cq() nor rxe_create_cq() explicitly validates
attr->comp_vector. Is this guaranteed to be validated by the core before
reaching the driver, or should rxe still enforce device-specific limits?
> - if (err) {
> - rxe_dbg_cq(cq, "create cq failed, err = %d\n", err);
> + if (err)
> goto err_cleanup;
The err_cleanup label is only used for this specific error path. It may
improve readability to inline the cleanup logic at this site and remove
the label altogether.
> - }
>
> return 0;
>
> err_cleanup:
> - cleanup_err = rxe_cleanup(cq);
> - if (cleanup_err)
> - rxe_err_cq(cq, "cleanup failed, err = %d\n", cleanup_err);
> -err_out:
> - rxe_err_dev(rxe, "returned err = %d\n", err);
> + rxe_cleanup(cq);
> + return err;
> +}
> +
> +static int rxe_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
> + struct uverbs_attr_bundle *attrs)
> +{
> + struct ib_device *dev = ibcq->device;
> + struct rxe_dev *rxe = to_rdev(dev);
> + struct rxe_cq *cq = to_rcq(ibcq);
> + int err;
> +
> + if (attr->flags)
> + return -EOPNOTSUPP;
> +
> + if (attr->cqe > rxe->attr.max_cqe)
> + return -EINVAL;
> +
> + err = rxe_add_to_pool(&rxe->cq_pool, cq);
> + if (err)
> + return err;
> +
> + err = rxe_cq_from_init(rxe, cq, attr->cqe, attr->comp_vector, NULL,
> + NULL);
> + if (err)
> + goto err_cleanup;
ditto
Thanks a lot.
Zhu Yanjun
> +
> + return 0;
> +
> +err_cleanup:
> + rxe_cleanup(cq);
> return err;
> }
>
> @@ -1478,6 +1490,7 @@ static const struct ib_device_ops rxe_dev_ops = {
> .attach_mcast = rxe_attach_mcast,
> .create_ah = rxe_create_ah,
> .create_cq = rxe_create_cq,
> + .create_user_cq = rxe_create_user_cq,
> .create_qp = rxe_create_qp,
> .create_srq = rxe_create_srq,
> .create_user_ah = rxe_create_ah,
>
^ permalink raw reply
* RE: [PATCH] x86: mshyperv: Use kthread for vmbus interrupts on PREEMPT_RT
From: mhklkml @ 2026-02-13 21:35 UTC (permalink / raw)
To: 'Jan Kiszka', 'Michael Kelley',
'Florian Bezdeka', 'K. Y. Srinivasan',
'Haiyang Zhang', 'Wei Liu', 'Dexuan Cui',
'Long Li', 'Thomas Gleixner',
'Ingo Molnar', 'Borislav Petkov',
'Dave Hansen', x86
Cc: linux-hyperv, linux-kernel, 'RT', 'Mitchell Levy',
skinsburskii, mrathor, anirudh, schakrabarti, ssengar
In-Reply-To: <b084a7b6-c394-4337-82cd-8b9cb911d8d5@siemens.com>
From: Jan Kiszka <jan.kiszka@siemens.com> Sent: Thursday, February 12, 2026 8:06 AM
>
> On 09.02.26 19:25, Michael Kelley wrote:
> > From: Florian Bezdeka <florian.bezdeka@siemens.com> Sent: Monday, February 9, 2026 2:35 AM
> >>
> >> On Sat, 2026-02-07 at 01:30 +0000, Michael Kelley wrote:
> >>
> >> [snip]
> >>>
> >>> I've run your suggested experiment on an arm64 VM in the Azure cloud. My
> >>> kernel was linux-next 20260128. I set CONFIG_PREEMPT_RT=y and
> >>> CONFIG_PROVE_LOCKING=y, but did not add either of your two patches
> >>> (neither the storvsc driver patch nor the x86 VMBus interrupt handling patch).
> >>> The VM comes up and runs, but with this warning during boot:
> >>>
> >>> [ 3.075604] hv_utils: Registering HyperV Utility Driver
> >>> [ 3.075636] hv_vmbus: registering driver hv_utils
> >>> [ 3.085920] =============================
> >>> [ 3.088128] hv_vmbus: registering driver hv_netvsc
> >>> [ 3.091180] [ BUG: Invalid wait context ]
> >>> [ 3.093544] 6.19.0-rc7-next-20260128+ #3 Tainted: G E
> >>> [ 3.097582] -----------------------------
> >>> [ 3.099899] systemd-udevd/284 is trying to lock:
> >>> [ 3.102568] ffff000100e24490 (&channel->sched_lock){....}-{3:3}, at: vmbus_chan_sched+0x128/0x3b8 [hv_vmbus]
> >>> [ 3.108208] other info that might help us debug this:
> >>> [ 3.111454] context-{2:2}
> >>> [ 3.112987] 1 lock held by systemd-udevd/284:
> >>> [ 3.115626] #0: ffffd5cfc20bcc80 (rcu_read_lock){....}-{1:3}, at: vmbus_chan_sched+0xcc/0x3b8 [hv_vmbus]
> >>> [ 3.121224] stack backtrace:
> >>> [ 3.122897] CPU: 0 UID: 0 PID: 284 Comm: systemd-udevd Tainted: G E 6.19.0-rc7-next-20260128+ #3 PREEMPT_RT
> >>> [ 3.129631] Tainted: [E]=UNSIGNED_MODULE
> >>> [ 3.131946] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 06/10/2025
> >>> [ 3.138553] Call trace:
> >>> [ 3.140015] show_stack+0x20/0x38 (C)
> >>> [ 3.142137] dump_stack_lvl+0x9c/0x158
> >>> [ 3.144340] dump_stack+0x18/0x28
> >>> [ 3.146290] __lock_acquire+0x488/0x1e20
> >>> [ 3.148569] lock_acquire+0x11c/0x388
> >>> [ 3.150703] rt_spin_lock+0x54/0x230
> >>> [ 3.152785] vmbus_chan_sched+0x128/0x3b8 [hv_vmbus]
> >>> [ 3.155611] vmbus_isr+0x34/0x80 [hv_vmbus]
> >>> [ 3.158093] vmbus_percpu_isr+0x18/0x30 [hv_vmbus]
> >>> [ 3.160848] handle_percpu_devid_irq+0xdc/0x348
> >>> [ 3.163495] handle_irq_desc+0x48/0x68
> >>> [ 3.165851] generic_handle_domain_irq+0x20/0x38
> >>> [ 3.168664] gic_handle_irq+0x1dc/0x430
> >>> [ 3.170868] call_on_irq_stack+0x30/0x70
> >>> [ 3.173161] do_interrupt_handler+0x88/0xa0
> >>> [ 3.175724] el1_interrupt+0x4c/0xb0
> >>> [ 3.177855] el1h_64_irq_handler+0x18/0x28
> >>> [ 3.180332] el1h_64_irq+0x84/0x88
> >>> [ 3.182378] _raw_spin_unlock_irqrestore+0x4c/0xb0 (P)
> >>> [ 3.185493] rt_mutex_slowunlock+0x404/0x440
> >>> [ 3.187951] rt_spin_unlock+0xb8/0x178
> >>> [ 3.190394] kmem_cache_alloc_noprof+0xf0/0x4f8
> >>> [ 3.193100] alloc_empty_file+0x64/0x148
> >>> [ 3.195461] path_openat+0x58/0xaa0
> >>> [ 3.197658] do_file_open+0xa0/0x140
> >>> [ 3.199752] do_sys_openat2+0x190/0x278
> >>> [ 3.202124] do_sys_open+0x60/0xb8
> >>> [ 3.204047] __arm64_sys_openat+0x2c/0x48
> >>> [ 3.206433] invoke_syscall+0x6c/0xf8
> >>> [ 3.208519] el0_svc_common.constprop.0+0x48/0xf0
> >>> [ 3.211050] do_el0_svc+0x24/0x38
> >>> [ 3.212990] el0_svc+0x164/0x3c8
> >>> [ 3.214842] el0t_64_sync_handler+0xd0/0xe8
> >>> [ 3.217251] el0t_64_sync+0x1b0/0x1b8
> >>> [ 3.219450] hv_utils: Heartbeat IC version 3.0
> >>> [ 3.219471] hv_utils: Shutdown IC version 3.2
> >>> [ 3.219844] hv_utils: TimeSync IC version 4.0
> >>
> >> That matches with my expectation that the same problem exists on arm64.
> >> The patch from Jan addresses that issue for x86 (only, so far) as we do
> >> not have a working test environment for arm64 yet.
> >
> > OK. I had understood Jan's earlier comments to mean that the VMBus
> > interrupt problem was implicitly solved on arm64 because of VMBus using
> > a standard Linux IRQ on arm64. But evidently that's not the case. So my
> > earlier comment stands: The code changes should go into the architecture
> > independent portion of the VMBus driver, and not under arch/x86. I
> > can probably work with you to test on arm64 if need be.
> >
>
> I can move the code, sure, but I still haven't understood what
> invalidates my assumptions (beside what you observed). vmbus_drv calls
> request_percpu_irq, and that is - as far as I can see - not injecting
> IRQF_NO_THREAD. Any explanations welcome.
I haven't setup detailed debugging on arm64 yet, but in prep for that
I went looking at the places in the kernel IRQ handling where
IRQF_NO_THREAD influences behavior. The key function appears to be
irq_setup_forced_threading(). This function first checks force_irqthreads(),
which will be "true" when PREEMPT_RT is set. The function then checks
the IRQF_NO_THREAD flag and the IRQF_PERCPU flag. From what I can
see, the IRQF_PERCPU flag is treated like the IRQF_NO_THREAD flag, and
causes forced threading to *not* be done. So the behavior ends up being
the same as when PREEMPT_RT is not set.
Since the VMBus interrupt is a per-cpu interrupt, forced threading is not
done. In that case, the stack trace I reported makes sense. Take a look at
the code and see if you agree.
Michael
>
> Reproduction is still not possible for me. I was playing a bit with qemu
> in the hope to make it provide its minimal vmbus support (for
> ballooning), but that was not yet successful on arm64.
>
^ permalink raw reply
* Re: [PATCH rdma-next 28/50] RDMA/siw: Split user and kernel CQ creation paths
From: Leon Romanovsky @ 2026-02-13 21:17 UTC (permalink / raw)
To: Bernard Metzler
Cc: Jason Gunthorpe, Selvin Xavier, Kalesh AP, Potnuri Bharat Teja,
Michael Margolin, Gal Pressman, Yossi Leybovich, Cheng Xu,
Kai Shen, Chengchang Tang, Junxian Huang, Abhijit Gangurde,
Allen Hubbe, Krzysztof Czurylo, Tatyana Nikolova, Long Li,
Konstantin Taranov, Yishai Hadas, Michal Kalderon, Bryan Tan,
Vishnu Dasa, Broadcom internal kernel review list,
Christian Benvenuti, Nelson Escobar, Dennis Dalessandro,
Zhu Yanjun, linux-kernel, linux-rdma, linux-hyperv
In-Reply-To: <054452b7-7e08-4f8c-8010-e1b69c4b3997@linux.dev>
On Fri, Feb 13, 2026 at 05:56:32PM +0100, Bernard Metzler wrote:
> On 13.02.2026 11:58, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> >
> > Separate the CQ creation logic into distinct kernel and user flows.
> >
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> > drivers/infiniband/sw/siw/siw_main.c | 1 +
> > drivers/infiniband/sw/siw/siw_verbs.c | 111 +++++++++++++++++++++++-----------
> > drivers/infiniband/sw/siw/siw_verbs.h | 2 +
> > 3 files changed, 80 insertions(+), 34 deletions(-)
<...>
> > +int siw_create_cq(struct ib_cq *base_cq, const struct ib_cq_init_attr *attr,
> > + struct uverbs_attr_bundle *attrs)
> > +{
> > + struct siw_device *sdev = to_siw_dev(base_cq->device);
> > + struct siw_cq *cq = to_siw_cq(base_cq);
> > + int rv, size = attr->cqe;
> > +
> > + if (attr->flags)
> > + return -EOPNOTSUPP;
> > +
> > + if (atomic_inc_return(&sdev->num_cq) > SIW_MAX_CQ) {
> > + siw_dbg(base_cq->device, "too many CQ's\n");
> > + rv = -ENOMEM;
> > + goto err_out;
> > + }
> > + if (size < 1 || size > sdev->attrs.max_cqe) {
>
> isn't there now also a check for zero sized CQ in
> __ib_alloc_cq(), which obsoletes that < 1 check?
Thanks, this line needs to be changed to be if "(attr.cqe > sdev->attrs.max_cqe)"
>
> Everything looks right otherwise.
>
> Thanks,
> Bernard.
Thanks
^ permalink raw reply
* Re: [PATCH rdma-next 28/50] RDMA/siw: Split user and kernel CQ creation paths
From: Bernard Metzler @ 2026-02-13 16:56 UTC (permalink / raw)
To: Leon Romanovsky, Jason Gunthorpe, Selvin Xavier, Kalesh AP,
Potnuri Bharat Teja, Michael Margolin, Gal Pressman,
Yossi Leybovich, Cheng Xu, Kai Shen, Chengchang Tang,
Junxian Huang, Abhijit Gangurde, Allen Hubbe, Krzysztof Czurylo,
Tatyana Nikolova, Long Li, Konstantin Taranov, Yishai Hadas,
Michal Kalderon, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Christian Benvenuti,
Nelson Escobar, Dennis Dalessandro, Zhu Yanjun
Cc: linux-kernel, linux-rdma, linux-hyperv
In-Reply-To: <20260213-refactor-umem-v1-28-f3be85847922@nvidia.com>
On 13.02.2026 11:58, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
>
> Separate the CQ creation logic into distinct kernel and user flows.
>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
> drivers/infiniband/sw/siw/siw_main.c | 1 +
> drivers/infiniband/sw/siw/siw_verbs.c | 111 +++++++++++++++++++++++-----------
> drivers/infiniband/sw/siw/siw_verbs.h | 2 +
> 3 files changed, 80 insertions(+), 34 deletions(-)
>
> diff --git a/drivers/infiniband/sw/siw/siw_main.c b/drivers/infiniband/sw/siw/siw_main.c
> index 5168307229a9..75dcf3578eac 100644
> --- a/drivers/infiniband/sw/siw/siw_main.c
> +++ b/drivers/infiniband/sw/siw/siw_main.c
> @@ -232,6 +232,7 @@ static const struct ib_device_ops siw_device_ops = {
> .alloc_pd = siw_alloc_pd,
> .alloc_ucontext = siw_alloc_ucontext,
> .create_cq = siw_create_cq,
> + .create_user_cq = siw_create_user_cq,
> .create_qp = siw_create_qp,
> .create_srq = siw_create_srq,
> .dealloc_driver = siw_device_cleanup,
> diff --git a/drivers/infiniband/sw/siw/siw_verbs.c b/drivers/infiniband/sw/siw/siw_verbs.c
> index efa2f097b582..92b25b389b69 100644
> --- a/drivers/infiniband/sw/siw/siw_verbs.c
> +++ b/drivers/infiniband/sw/siw/siw_verbs.c
> @@ -1139,15 +1139,15 @@ int siw_destroy_cq(struct ib_cq *base_cq, struct ib_udata *udata)
> * @attrs: uverbs bundle
> */
>
> -int siw_create_cq(struct ib_cq *base_cq, const struct ib_cq_init_attr *attr,
> - struct uverbs_attr_bundle *attrs)
> +int siw_create_user_cq(struct ib_cq *base_cq, const struct ib_cq_init_attr *attr,
> + struct uverbs_attr_bundle *attrs)
> {
> struct ib_udata *udata = &attrs->driver_udata;
> struct siw_device *sdev = to_siw_dev(base_cq->device);
> struct siw_cq *cq = to_siw_cq(base_cq);
> int rv, size = attr->cqe;
>
> - if (attr->flags)
> + if (attr->flags || base_cq->umem)
> return -EOPNOTSUPP;
>
> if (atomic_inc_return(&sdev->num_cq) > SIW_MAX_CQ) {
> @@ -1155,7 +1155,7 @@ int siw_create_cq(struct ib_cq *base_cq, const struct ib_cq_init_attr *attr,
> rv = -ENOMEM;
> goto err_out;
> }
> - if (size < 1 || size > sdev->attrs.max_cqe) {
> + if (attr->cqe > sdev->attrs.max_cqe) {
> siw_dbg(base_cq->device, "CQ size error: %d\n", size);
> rv = -EINVAL;
> goto err_out;
> @@ -1164,13 +1164,8 @@ int siw_create_cq(struct ib_cq *base_cq, const struct ib_cq_init_attr *attr,
> cq->base_cq.cqe = size;
> cq->num_cqe = size;
>
> - if (udata)
> - cq->queue = vmalloc_user(size * sizeof(struct siw_cqe) +
> - sizeof(struct siw_cq_ctrl));
> - else
> - cq->queue = vzalloc(size * sizeof(struct siw_cqe) +
> - sizeof(struct siw_cq_ctrl));
> -
> + cq->queue = vmalloc_user(size * sizeof(struct siw_cqe) +
> + sizeof(struct siw_cq_ctrl));
> if (cq->queue == NULL) {
> rv = -ENOMEM;
> goto err_out;
> @@ -1182,33 +1177,32 @@ int siw_create_cq(struct ib_cq *base_cq, const struct ib_cq_init_attr *attr,
>
> cq->notify = (struct siw_cq_ctrl *)&cq->queue[size];
>
> - if (udata) {
> - struct siw_uresp_create_cq uresp = {};
> - struct siw_ucontext *ctx =
> - rdma_udata_to_drv_context(udata, struct siw_ucontext,
> - base_ucontext);
> - size_t length = size * sizeof(struct siw_cqe) +
> - sizeof(struct siw_cq_ctrl);
> + struct siw_uresp_create_cq uresp = {};
> + struct siw_ucontext *ctx =
> + rdma_udata_to_drv_context(udata, struct siw_ucontext,
> + base_ucontext);
> + size_t length = size * sizeof(struct siw_cqe) +
> + sizeof(struct siw_cq_ctrl);
>
> - cq->cq_entry =
> - siw_mmap_entry_insert(ctx, cq->queue,
> - length, &uresp.cq_key);
> - if (!cq->cq_entry) {
> - rv = -ENOMEM;
> - goto err_out;
> - }
> + cq->cq_entry =
> + siw_mmap_entry_insert(ctx, cq->queue,
> + length, &uresp.cq_key);
> + if (!cq->cq_entry) {
> + rv = -ENOMEM;
> + goto err_out;
> + }
>
> - uresp.cq_id = cq->id;
> - uresp.num_cqe = size;
> + uresp.cq_id = cq->id;
> + uresp.num_cqe = size;
>
> - if (udata->outlen < sizeof(uresp)) {
> - rv = -EINVAL;
> - goto err_out;
> - }
> - rv = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
> - if (rv)
> - goto err_out;
> + if (udata->outlen < sizeof(uresp)) {
> + rv = -EINVAL;
> + goto err_out;
> }
> + rv = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
> + if (rv)
> + goto err_out;
> +
> return 0;
>
> err_out:
> @@ -1227,6 +1221,55 @@ int siw_create_cq(struct ib_cq *base_cq, const struct ib_cq_init_attr *attr,
> return rv;
> }
>
> +int siw_create_cq(struct ib_cq *base_cq, const struct ib_cq_init_attr *attr,
> + struct uverbs_attr_bundle *attrs)
> +{
> + struct siw_device *sdev = to_siw_dev(base_cq->device);
> + struct siw_cq *cq = to_siw_cq(base_cq);
> + int rv, size = attr->cqe;
> +
> + if (attr->flags)
> + return -EOPNOTSUPP;
> +
> + if (atomic_inc_return(&sdev->num_cq) > SIW_MAX_CQ) {
> + siw_dbg(base_cq->device, "too many CQ's\n");
> + rv = -ENOMEM;
> + goto err_out;
> + }
> + if (size < 1 || size > sdev->attrs.max_cqe) {
isn't there now also a check for zero sized CQ in
__ib_alloc_cq(), which obsoletes that < 1 check?
Everything looks right otherwise.
Thanks,
Bernard.
> + siw_dbg(base_cq->device, "CQ size error: %d\n", size);
> + rv = -EINVAL;
> + goto err_out;
> + }
> + size = roundup_pow_of_two(size);
> + cq->base_cq.cqe = size;
> + cq->num_cqe = size;
> +
> + cq->queue = vzalloc(size * sizeof(struct siw_cqe) +
> + sizeof(struct siw_cq_ctrl));
> + if (cq->queue == NULL) {
> + rv = -ENOMEM;
> + goto err_out;
> + }
> + get_random_bytes(&cq->id, 4);
> + siw_dbg(base_cq->device, "new CQ [%u]\n", cq->id);
> +
> + spin_lock_init(&cq->lock);
> +
> + cq->notify = (struct siw_cq_ctrl *)&cq->queue[size];
> +
> + return 0;
> +
> +err_out:
> + siw_dbg(base_cq->device, "CQ creation failed: %d", rv);
> +
> + if (cq->queue)
> + vfree(cq->queue);
> + atomic_dec(&sdev->num_cq);
> +
> + return rv;
> +}
> +
> /*
> * siw_poll_cq()
> *
> diff --git a/drivers/infiniband/sw/siw/siw_verbs.h b/drivers/infiniband/sw/siw/siw_verbs.h
> index e9f4463aecdc..527c356b55af 100644
> --- a/drivers/infiniband/sw/siw/siw_verbs.h
> +++ b/drivers/infiniband/sw/siw/siw_verbs.h
> @@ -44,6 +44,8 @@ int siw_query_device(struct ib_device *base_dev, struct ib_device_attr *attr,
> struct ib_udata *udata);
> int siw_create_cq(struct ib_cq *base_cq, const struct ib_cq_init_attr *attr,
> struct uverbs_attr_bundle *attrs);
> +int siw_create_user_cq(struct ib_cq *base_cq, const struct ib_cq_init_attr *attr,
> + struct uverbs_attr_bundle *attrs);
> int siw_query_port(struct ib_device *base_dev, u32 port,
> struct ib_port_attr *attr);
> int siw_query_gid(struct ib_device *base_dev, u32 port, int idx,
>
^ permalink raw reply
* [PATCH rdma-next 48/50] RDMA/mlx5: Select resize‑CQ callback based on device capabilities
From: Leon Romanovsky @ 2026-02-13 10:58 UTC (permalink / raw)
To: Jason Gunthorpe, Leon Romanovsky, Selvin Xavier, Kalesh AP,
Potnuri Bharat Teja, Michael Margolin, Gal Pressman,
Yossi Leybovich, Cheng Xu, Kai Shen, Chengchang Tang,
Junxian Huang, Abhijit Gangurde, Allen Hubbe, Krzysztof Czurylo,
Tatyana Nikolova, Long Li, Konstantin Taranov, Yishai Hadas,
Michal Kalderon, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Christian Benvenuti,
Nelson Escobar, Dennis Dalessandro, Bernard Metzler, Zhu Yanjun
Cc: linux-kernel, linux-rdma, linux-hyperv
In-Reply-To: <20260213-refactor-umem-v1-0-f3be85847922@nvidia.com>
From: Leon Romanovsky <leonro@nvidia.com>
Remove the legacy capability check when issuing the resize‑CQ command.
Instead, rely on choosing the correct ops during initialization.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/infiniband/hw/mlx5/cq.c | 5 -----
drivers/infiniband/hw/mlx5/main.c | 8 +++++++-
2 files changed, 7 insertions(+), 6 deletions(-)
diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index f7fb6f4aef7d..88f0f5e2944f 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -1267,11 +1267,6 @@ int mlx5_ib_resize_cq(struct ib_cq *ibcq, unsigned int entries,
int inlen;
int cqe_size;
- if (!MLX5_CAP_GEN(dev->mdev, cq_resize)) {
- pr_info("Firmware does not support resize CQ\n");
- return -ENOSYS;
- }
-
if (entries > (1 << MLX5_CAP_GEN(dev->mdev, log_max_cq_sz)))
return -EINVAL;
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 0471155eb739..f86721681f5b 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -4496,7 +4496,6 @@ static const struct ib_device_ops mlx5_ib_dev_ops = {
.reg_user_mr_dmabuf = mlx5_ib_reg_user_mr_dmabuf,
.req_notify_cq = mlx5_ib_arm_cq,
.rereg_user_mr = mlx5_ib_rereg_user_mr,
- .resize_user_cq = mlx5_ib_resize_cq,
.ufile_hw_cleanup = mlx5_ib_ufile_hw_cleanup,
INIT_RDMA_OBJ_SIZE(ib_ah, mlx5_ib_ah, ibah),
@@ -4509,6 +4508,10 @@ static const struct ib_device_ops mlx5_ib_dev_ops = {
INIT_RDMA_OBJ_SIZE(ib_ucontext, mlx5_ib_ucontext, ibucontext),
};
+static const struct ib_device_ops mlx5_ib_dev_resize_cq_ops = {
+ .resize_user_cq = mlx5_ib_resize_cq,
+};
+
static const struct ib_device_ops mlx5_ib_dev_ipoib_enhanced_ops = {
.rdma_netdev_get_params = mlx5_ib_rn_get_params,
};
@@ -4635,6 +4638,9 @@ static int mlx5_ib_stage_caps_init(struct mlx5_ib_dev *dev)
ib_set_device_ops(&dev->ib_dev, &mlx5_ib_dev_ops);
+ if (MLX5_CAP_GEN(mdev, cq_resize))
+ ib_set_device_ops(&dev->ib_dev, &mlx5_ib_dev_resize_cq_ops);
+
if (IS_ENABLED(CONFIG_INFINIBAND_USER_ACCESS))
dev->ib_dev.driver_def = mlx5_ib_defs;
--
2.52.0
^ permalink raw reply related
* [PATCH rdma-next 50/50] RDMA/mthca: Use generic resize-CQ lock
From: Leon Romanovsky @ 2026-02-13 10:58 UTC (permalink / raw)
To: Jason Gunthorpe, Leon Romanovsky, Selvin Xavier, Kalesh AP,
Potnuri Bharat Teja, Michael Margolin, Gal Pressman,
Yossi Leybovich, Cheng Xu, Kai Shen, Chengchang Tang,
Junxian Huang, Abhijit Gangurde, Allen Hubbe, Krzysztof Czurylo,
Tatyana Nikolova, Long Li, Konstantin Taranov, Yishai Hadas,
Michal Kalderon, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Christian Benvenuti,
Nelson Escobar, Dennis Dalessandro, Bernard Metzler, Zhu Yanjun
Cc: linux-kernel, linux-rdma, linux-hyperv
In-Reply-To: <20260213-refactor-umem-v1-0-f3be85847922@nvidia.com>
From: Leon Romanovsky <leonro@nvidia.com>
Replace the open‑coded resize‑CQ lock with the standard core
implementation for better consistency and maintainability.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/infiniband/hw/mthca/mthca_cq.c | 1 -
drivers/infiniband/hw/mthca/mthca_provider.c | 20 ++++++--------------
drivers/infiniband/hw/mthca/mthca_provider.h | 1 -
3 files changed, 6 insertions(+), 16 deletions(-)
diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c
index 26c3408dcaca..9c15e9b886d1 100644
--- a/drivers/infiniband/hw/mthca/mthca_cq.c
+++ b/drivers/infiniband/hw/mthca/mthca_cq.c
@@ -819,7 +819,6 @@ int mthca_init_cq(struct mthca_dev *dev, int nent,
spin_lock_init(&cq->lock);
cq->refcount = 1;
init_waitqueue_head(&cq->wait);
- mutex_init(&cq->mutex);
memset(cq_context, 0, sizeof *cq_context);
cq_context->flags = cpu_to_be32(MTHCA_CQ_STATUS_OK |
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
index 85de004547ab..cb94d73e89d6 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -680,28 +680,20 @@ static int mthca_resize_cq(struct ib_cq *ibcq, unsigned int entries,
if (entries > dev->limits.max_cqes)
return -EINVAL;
- mutex_lock(&cq->mutex);
-
entries = roundup_pow_of_two(entries + 1);
- if (entries == ibcq->cqe + 1) {
- ret = 0;
- goto out;
- }
+ if (entries == ibcq->cqe + 1)
+ return 0;
- if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) {
- ret = -EFAULT;
- goto out;
- }
+ if (ib_copy_from_udata(&ucmd, udata, sizeof(ucmd)))
+ return -EFAULT;
lkey = ucmd.lkey;
ret = mthca_RESIZE_CQ(dev, cq->cqn, lkey, ilog2(entries));
if (ret)
- goto out;
+ return ret;
ibcq->cqe = entries - 1;
-out:
- mutex_unlock(&cq->mutex);
- return ret;
+ return 0;
}
static int mthca_destroy_cq(struct ib_cq *cq, struct ib_udata *udata)
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.h b/drivers/infiniband/hw/mthca/mthca_provider.h
index 8a77483bb33c..7797d76fb93d 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.h
+++ b/drivers/infiniband/hw/mthca/mthca_provider.h
@@ -198,7 +198,6 @@ struct mthca_cq {
int arm_sn;
wait_queue_head_t wait;
- struct mutex mutex;
};
struct mthca_srq {
--
2.52.0
^ permalink raw reply related
* [PATCH rdma-next 49/50] RDMA/mlx5: Reduce CQ memory footprint
From: Leon Romanovsky @ 2026-02-13 10:58 UTC (permalink / raw)
To: Jason Gunthorpe, Leon Romanovsky, Selvin Xavier, Kalesh AP,
Potnuri Bharat Teja, Michael Margolin, Gal Pressman,
Yossi Leybovich, Cheng Xu, Kai Shen, Chengchang Tang,
Junxian Huang, Abhijit Gangurde, Allen Hubbe, Krzysztof Czurylo,
Tatyana Nikolova, Long Li, Konstantin Taranov, Yishai Hadas,
Michal Kalderon, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Christian Benvenuti,
Nelson Escobar, Dennis Dalessandro, Bernard Metzler, Zhu Yanjun
Cc: linux-kernel, linux-rdma, linux-hyperv
In-Reply-To: <20260213-refactor-umem-v1-0-f3be85847922@nvidia.com>
From: Leon Romanovsky <leonro@nvidia.com>
There is no need to store a temporary umem pointer in the generic CQ
object. Use an on‑stack variable instead.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/infiniband/hw/mlx5/cq.c | 64 ++++++++++++------------------------
drivers/infiniband/hw/mlx5/mlx5_ib.h | 1 -
2 files changed, 21 insertions(+), 44 deletions(-)
diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index 88f0f5e2944f..6d9b62742674 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -1218,44 +1218,13 @@ int mlx5_ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period)
return err;
}
-static int resize_user(struct mlx5_ib_dev *dev, struct mlx5_ib_cq *cq,
- int entries, struct ib_udata *udata,
- int *cqe_size)
-{
- struct mlx5_ib_resize_cq ucmd;
- struct ib_umem *umem;
- int err;
-
- err = ib_copy_from_udata(&ucmd, udata, sizeof(ucmd));
- if (err)
- return err;
-
- if (ucmd.reserved0 || ucmd.reserved1)
- return -EINVAL;
-
- /* check multiplication overflow */
- if (ucmd.cqe_size && SIZE_MAX / ucmd.cqe_size <= entries - 1)
- return -EINVAL;
-
- umem = ib_umem_get(&dev->ib_dev, ucmd.buf_addr,
- (size_t)ucmd.cqe_size * entries,
- IB_ACCESS_LOCAL_WRITE);
- if (IS_ERR(umem)) {
- err = PTR_ERR(umem);
- return err;
- }
-
- cq->resize_umem = umem;
- *cqe_size = ucmd.cqe_size;
-
- return 0;
-}
-
int mlx5_ib_resize_cq(struct ib_cq *ibcq, unsigned int entries,
struct ib_udata *udata)
{
struct mlx5_ib_dev *dev = to_mdev(ibcq->device);
struct mlx5_ib_cq *cq = to_mcq(ibcq);
+ struct mlx5_ib_resize_cq ucmd;
+ struct ib_umem *umem;
unsigned long page_size;
void *cqc;
u32 *in;
@@ -1264,8 +1233,8 @@ int mlx5_ib_resize_cq(struct ib_cq *ibcq, unsigned int entries,
__be64 *pas;
unsigned int page_offset_quantized = 0;
unsigned int page_shift;
+ size_t umem_size;
int inlen;
- int cqe_size;
if (entries > (1 << MLX5_CAP_GEN(dev->mdev, log_max_cq_sz)))
return -EINVAL;
@@ -1277,18 +1246,29 @@ int mlx5_ib_resize_cq(struct ib_cq *ibcq, unsigned int entries,
if (entries == ibcq->cqe + 1)
return 0;
- err = resize_user(dev, cq, entries, udata, &cqe_size);
+ err = ib_copy_from_udata(&ucmd, udata, sizeof(ucmd));
if (err)
return err;
+ if (ucmd.reserved0 || ucmd.reserved1)
+ return -EINVAL;
+
+ if (check_mul_overflow(ucmd.cqe_size, entries, &umem_size))
+ return -EINVAL;
+
+ umem = ib_umem_get(&dev->ib_dev, ucmd.buf_addr, umem_size,
+ IB_ACCESS_LOCAL_WRITE);
+ if (IS_ERR(umem))
+ return PTR_ERR(umem);
+
page_size = mlx5_umem_find_best_cq_quantized_pgoff(
- cq->resize_umem, cqc, log_page_size, MLX5_ADAPTER_PAGE_SHIFT,
+ umem, cqc, log_page_size, MLX5_ADAPTER_PAGE_SHIFT,
page_offset, 64, &page_offset_quantized);
if (!page_size) {
err = -EINVAL;
goto ex_resize;
}
- npas = ib_umem_num_dma_blocks(cq->resize_umem, page_size);
+ npas = ib_umem_num_dma_blocks(umem, page_size);
page_shift = order_base_2(page_size);
inlen = MLX5_ST_SZ_BYTES(modify_cq_in) +
@@ -1301,7 +1281,7 @@ int mlx5_ib_resize_cq(struct ib_cq *ibcq, unsigned int entries,
}
pas = (__be64 *)MLX5_ADDR_OF(modify_cq_in, in, pas);
- mlx5_ib_populate_pas(cq->resize_umem, 1UL << page_shift, pas, 0);
+ mlx5_ib_populate_pas(umem, 1UL << page_shift, pas, 0);
MLX5_SET(modify_cq_in, in,
modify_field_select_resize_field_select.resize_field_select.resize_field_select,
@@ -1315,7 +1295,7 @@ int mlx5_ib_resize_cq(struct ib_cq *ibcq, unsigned int entries,
page_shift - MLX5_ADAPTER_PAGE_SHIFT);
MLX5_SET(cqc, cqc, page_offset, page_offset_quantized);
MLX5_SET(cqc, cqc, cqe_sz,
- cqe_sz_to_mlx_sz(cqe_size,
+ cqe_sz_to_mlx_sz(ucmd.cqe_size,
cq->private_flags &
MLX5_IB_CQ_PR_FLAGS_CQE_128_PAD));
MLX5_SET(cqc, cqc, log_cq_size, ilog2(entries));
@@ -1329,8 +1309,7 @@ int mlx5_ib_resize_cq(struct ib_cq *ibcq, unsigned int entries,
cq->ibcq.cqe = entries - 1;
ib_umem_release(cq->ibcq.umem);
- cq->ibcq.umem = cq->resize_umem;
- cq->resize_umem = NULL;
+ cq->ibcq.umem = umem;
kvfree(in);
return 0;
@@ -1339,8 +1318,7 @@ int mlx5_ib_resize_cq(struct ib_cq *ibcq, unsigned int entries,
kvfree(in);
ex_resize:
- ib_umem_release(cq->resize_umem);
- cq->resize_umem = NULL;
+ ib_umem_release(umem);
return err;
}
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 7b34f32b5ecb..11e4b2ae0469 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -575,7 +575,6 @@ struct mlx5_ib_cq {
spinlock_t lock;
struct mlx5_ib_cq_buf *resize_buf;
- struct ib_umem *resize_umem;
int cqe_size;
struct list_head list_send_qp;
struct list_head list_recv_qp;
--
2.52.0
^ permalink raw reply related
* [PATCH rdma-next 44/50] RDMA/bnxt_re: Reduce CQ memory footprint
From: Leon Romanovsky @ 2026-02-13 10:58 UTC (permalink / raw)
To: Jason Gunthorpe, Leon Romanovsky, Selvin Xavier, Kalesh AP,
Potnuri Bharat Teja, Michael Margolin, Gal Pressman,
Yossi Leybovich, Cheng Xu, Kai Shen, Chengchang Tang,
Junxian Huang, Abhijit Gangurde, Allen Hubbe, Krzysztof Czurylo,
Tatyana Nikolova, Long Li, Konstantin Taranov, Yishai Hadas,
Michal Kalderon, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Christian Benvenuti,
Nelson Escobar, Dennis Dalessandro, Bernard Metzler, Zhu Yanjun
Cc: linux-kernel, linux-rdma, linux-hyperv
In-Reply-To: <20260213-refactor-umem-v1-0-f3be85847922@nvidia.com>
From: Leon Romanovsky <leonro@nvidia.com>
There is no need to store resize_cqe and resize_umem in CQ object.
Let's remove them.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/infiniband/hw/bnxt_re/ib_verbs.c | 37 +++++++++++---------------------
drivers/infiniband/hw/bnxt_re/ib_verbs.h | 2 --
2 files changed, 13 insertions(+), 26 deletions(-)
diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
index d544a4fb1e96..9a8bdb52097f 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
@@ -3320,6 +3320,8 @@ int bnxt_re_resize_cq(struct ib_cq *ibcq, unsigned int cqe,
struct bnxt_re_resize_cq_req req;
struct bnxt_re_dev *rdev;
struct bnxt_re_cq *cq;
+ struct ib_umem *umem;
+
int rc, entries;
cq = container_of(ibcq, struct bnxt_re_cq, ib_cq);
@@ -3336,26 +3338,18 @@ int bnxt_re_resize_cq(struct ib_cq *ibcq, unsigned int cqe,
entries = dev_attr->max_cq_wqes + 1;
/* uverbs consumer */
- if (ib_copy_from_udata(&req, udata, sizeof(req))) {
- rc = -EFAULT;
- goto fail;
- }
+ if (ib_copy_from_udata(&req, udata, sizeof(req)))
+ return -EFAULT;
- cq->resize_umem = ib_umem_get(&rdev->ibdev, req.cq_va,
- entries * sizeof(struct cq_base),
- IB_ACCESS_LOCAL_WRITE);
- if (IS_ERR(cq->resize_umem)) {
- rc = PTR_ERR(cq->resize_umem);
- ibdev_err(&rdev->ibdev, "%s: ib_umem_get failed! rc = %pe\n",
- __func__, cq->resize_umem);
- cq->resize_umem = NULL;
- return rc;
- }
- cq->resize_cqe = entries;
+ umem = ib_umem_get(&rdev->ibdev, req.cq_va,
+ entries * sizeof(struct cq_base),
+ IB_ACCESS_LOCAL_WRITE);
+ if (IS_ERR(umem))
+ return PTR_ERR(umem);
memcpy(&sg_info, &cq->qplib_cq.sg_info, sizeof(sg_info));
orig_dpi = cq->qplib_cq.dpi;
- cq->qplib_cq.sg_info.umem = cq->resize_umem;
+ cq->qplib_cq.sg_info.umem = umem;
cq->qplib_cq.sg_info.pgsize = PAGE_SIZE;
cq->qplib_cq.sg_info.pgshft = PAGE_SHIFT;
cq->qplib_cq.dpi = &uctx->dpi;
@@ -3369,21 +3363,16 @@ int bnxt_re_resize_cq(struct ib_cq *ibcq, unsigned int cqe,
bnxt_qplib_resize_cq_complete(&rdev->qplib_res, &cq->qplib_cq);
- cq->qplib_cq.max_wqe = cq->resize_cqe;
+ cq->qplib_cq.max_wqe = entries;
ib_umem_release(cq->ib_cq.umem);
- cq->ib_cq.umem = cq->resize_umem;
- cq->resize_umem = NULL;
- cq->resize_cqe = 0;
-
+ cq->ib_cq.umem = umem;
cq->ib_cq.cqe = entries;
atomic_inc(&rdev->stats.res.resize_count);
return 0;
fail:
- ib_umem_release(cq->resize_umem);
- cq->resize_umem = NULL;
- cq->resize_cqe = 0;
+ ib_umem_release(umem);
memcpy(&cq->qplib_cq.sg_info, &sg_info, sizeof(sg_info));
cq->qplib_cq.dpi = orig_dpi;
return rc;
diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.h b/drivers/infiniband/hw/bnxt_re/ib_verbs.h
index 7890d6ebad90..ee7ccaa2ed4c 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.h
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.h
@@ -108,8 +108,6 @@ struct bnxt_re_cq {
struct bnxt_qplib_cqe *cql;
#define MAX_CQL_PER_POLL 1024
u32 max_cql;
- struct ib_umem *resize_umem;
- int resize_cqe;
void *uctx_cq_page;
struct hlist_node hash_entry;
};
--
2.52.0
^ permalink raw reply related
* [PATCH rdma-next 47/50] RDMA/mlx5: Use generic resize-CQ lock
From: Leon Romanovsky @ 2026-02-13 10:58 UTC (permalink / raw)
To: Jason Gunthorpe, Leon Romanovsky, Selvin Xavier, Kalesh AP,
Potnuri Bharat Teja, Michael Margolin, Gal Pressman,
Yossi Leybovich, Cheng Xu, Kai Shen, Chengchang Tang,
Junxian Huang, Abhijit Gangurde, Allen Hubbe, Krzysztof Czurylo,
Tatyana Nikolova, Long Li, Konstantin Taranov, Yishai Hadas,
Michal Kalderon, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Christian Benvenuti,
Nelson Escobar, Dennis Dalessandro, Bernard Metzler, Zhu Yanjun
Cc: linux-kernel, linux-rdma, linux-hyperv
In-Reply-To: <20260213-refactor-umem-v1-0-f3be85847922@nvidia.com>
From: Leon Romanovsky <leonro@nvidia.com>
Replace the open‑coded resize‑CQ lock with the standard core
implementation for better consistency and maintainability.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/infiniband/hw/mlx5/cq.c | 8 +-------
drivers/infiniband/hw/mlx5/mlx5_ib.h | 3 ---
2 files changed, 1 insertion(+), 10 deletions(-)
diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index 78c3494517d7..f7fb6f4aef7d 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -972,7 +972,6 @@ int mlx5_ib_create_user_cq(struct ib_cq *ibcq,
return -EINVAL;
cq->ibcq.cqe = entries - 1;
- mutex_init(&cq->resize_mutex);
spin_lock_init(&cq->lock);
if (attr->flags & IB_UVERBS_CQ_FLAGS_TIMESTAMP_COMPLETION)
cq->private_flags |= MLX5_IB_CQ_PR_TIMESTAMP_COMPLETION;
@@ -1057,7 +1056,6 @@ int mlx5_ib_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
return -EINVAL;
cq->ibcq.cqe = entries - 1;
- mutex_init(&cq->resize_mutex);
spin_lock_init(&cq->lock);
INIT_LIST_HEAD(&cq->list_send_qp);
INIT_LIST_HEAD(&cq->list_recv_qp);
@@ -1284,10 +1282,9 @@ int mlx5_ib_resize_cq(struct ib_cq *ibcq, unsigned int entries,
if (entries == ibcq->cqe + 1)
return 0;
- mutex_lock(&cq->resize_mutex);
err = resize_user(dev, cq, entries, udata, &cqe_size);
if (err)
- goto ex;
+ return err;
page_size = mlx5_umem_find_best_cq_quantized_pgoff(
cq->resize_umem, cqc, log_page_size, MLX5_ADAPTER_PAGE_SHIFT,
@@ -1339,7 +1336,6 @@ int mlx5_ib_resize_cq(struct ib_cq *ibcq, unsigned int entries,
ib_umem_release(cq->ibcq.umem);
cq->ibcq.umem = cq->resize_umem;
cq->resize_umem = NULL;
- mutex_unlock(&cq->resize_mutex);
kvfree(in);
return 0;
@@ -1350,8 +1346,6 @@ int mlx5_ib_resize_cq(struct ib_cq *ibcq, unsigned int entries,
ex_resize:
ib_umem_release(cq->resize_umem);
cq->resize_umem = NULL;
-ex:
- mutex_unlock(&cq->resize_mutex);
return err;
}
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index e99a647ed62d..7b34f32b5ecb 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -574,9 +574,6 @@ struct mlx5_ib_cq {
*/
spinlock_t lock;
- /* protect resize cq
- */
- struct mutex resize_mutex;
struct mlx5_ib_cq_buf *resize_buf;
struct ib_umem *resize_umem;
int cqe_size;
--
2.52.0
^ permalink raw reply related
* [PATCH rdma-next 46/50] RDMA/mlx4: Use on‑stack variables instead of storing them in the CQ object
From: Leon Romanovsky @ 2026-02-13 10:58 UTC (permalink / raw)
To: Jason Gunthorpe, Leon Romanovsky, Selvin Xavier, Kalesh AP,
Potnuri Bharat Teja, Michael Margolin, Gal Pressman,
Yossi Leybovich, Cheng Xu, Kai Shen, Chengchang Tang,
Junxian Huang, Abhijit Gangurde, Allen Hubbe, Krzysztof Czurylo,
Tatyana Nikolova, Long Li, Konstantin Taranov, Yishai Hadas,
Michal Kalderon, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Christian Benvenuti,
Nelson Escobar, Dennis Dalessandro, Bernard Metzler, Zhu Yanjun
Cc: linux-kernel, linux-rdma, linux-hyperv
In-Reply-To: <20260213-refactor-umem-v1-0-f3be85847922@nvidia.com>
From: Leon Romanovsky <leonro@nvidia.com>
These variables do not need to persist for the lifetime of the CQ object.
They can be safely allocated on the stack instead.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/infiniband/hw/mlx4/cq.c | 81 +++++++++++++-----------------------
drivers/infiniband/hw/mlx4/mlx4_ib.h | 1 -
2 files changed, 28 insertions(+), 54 deletions(-)
diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index ffc3902dc329..6e8017ecf137 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -294,15 +294,29 @@ int mlx4_ib_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
return err;
}
-static int mlx4_alloc_resize_umem(struct mlx4_ib_dev *dev, struct mlx4_ib_cq *cq,
- int entries, struct ib_udata *udata)
+int mlx4_ib_resize_cq(struct ib_cq *ibcq, unsigned int entries,
+ struct ib_udata *udata)
{
+ struct mlx4_ib_dev *dev = to_mdev(ibcq->device);
+ struct mlx4_ib_cq *cq = to_mcq(ibcq);
struct mlx4_ib_resize_cq ucmd;
int cqe_size = dev->dev->caps.cqe_size;
+ struct ib_umem *umem;
+ struct mlx4_mtt mtt;
int shift;
int n;
int err;
+ if (entries > dev->dev->caps.max_cqes)
+ return -EINVAL;
+
+ entries = roundup_pow_of_two(entries + 1);
+ if (entries == ibcq->cqe + 1)
+ return 0;
+
+ if (entries > dev->dev->caps.max_cqes + 1)
+ return -EINVAL;
+
if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd))
return -EFAULT;
@@ -310,15 +324,14 @@ static int mlx4_alloc_resize_umem(struct mlx4_ib_dev *dev, struct mlx4_ib_cq *cq
if (!cq->resize_buf)
return -ENOMEM;
- cq->resize_umem = ib_umem_get(&dev->ib_dev, ucmd.buf_addr,
- entries * cqe_size,
- IB_ACCESS_LOCAL_WRITE);
- if (IS_ERR(cq->resize_umem)) {
- err = PTR_ERR(cq->resize_umem);
+ umem = ib_umem_get(&dev->ib_dev, ucmd.buf_addr,
+ entries * cqe_size, IB_ACCESS_LOCAL_WRITE);
+ if (IS_ERR(umem)) {
+ err = PTR_ERR(umem);
goto err_buf;
}
- shift = mlx4_ib_umem_calc_optimal_mtt_size(cq->resize_umem, 0, &n);
+ shift = mlx4_ib_umem_calc_optimal_mtt_size(umem, 0, &n);
if (shift < 0) {
err = shift;
goto err_umem;
@@ -328,73 +341,35 @@ static int mlx4_alloc_resize_umem(struct mlx4_ib_dev *dev, struct mlx4_ib_cq *cq
if (err)
goto err_umem;
- err = mlx4_ib_umem_write_mtt(dev, &cq->resize_buf->buf.mtt,
- cq->resize_umem);
+ err = mlx4_ib_umem_write_mtt(dev, &cq->resize_buf->buf.mtt, umem);
if (err)
goto err_mtt;
cq->resize_buf->cqe = entries - 1;
- return 0;
-
-err_mtt:
- mlx4_mtt_cleanup(dev->dev, &cq->resize_buf->buf.mtt);
-
-err_umem:
- ib_umem_release(cq->resize_umem);
- cq->resize_umem = NULL;
-err_buf:
- kfree(cq->resize_buf);
- cq->resize_buf = NULL;
- return err;
-}
-
-int mlx4_ib_resize_cq(struct ib_cq *ibcq, unsigned int entries,
- struct ib_udata *udata)
-{
- struct mlx4_ib_dev *dev = to_mdev(ibcq->device);
- struct mlx4_ib_cq *cq = to_mcq(ibcq);
- struct mlx4_mtt mtt;
- int err;
-
- if (entries > dev->dev->caps.max_cqes)
- return -EINVAL;
-
- entries = roundup_pow_of_two(entries + 1);
- if (entries == ibcq->cqe + 1)
- return 0;
-
- if (entries > dev->dev->caps.max_cqes + 1)
- return -EINVAL;
-
- err = mlx4_alloc_resize_umem(dev, cq, entries, udata);
- if (err)
- return err;
mtt = cq->buf.mtt;
err = mlx4_cq_resize(dev->dev, &cq->mcq, entries, &cq->resize_buf->buf.mtt);
if (err)
- goto err_buf;
+ goto err_mtt;
mlx4_mtt_cleanup(dev->dev, &mtt);
cq->buf = cq->resize_buf->buf;
cq->ibcq.cqe = cq->resize_buf->cqe;
ib_umem_release(cq->ibcq.umem);
- cq->ibcq.umem = cq->resize_umem;
+ cq->ibcq.umem = umem;
kfree(cq->resize_buf);
cq->resize_buf = NULL;
- cq->resize_umem = NULL;
return 0;
+err_mtt:
+ mlx4_mtt_cleanup(dev->dev, &cq->resize_buf->buf.mtt);
+err_umem:
+ ib_umem_release(umem);
err_buf:
- mlx4_mtt_cleanup(dev->dev, &cq->resize_buf->buf.mtt);
kfree(cq->resize_buf);
- cq->resize_buf = NULL;
-
- ib_umem_release(cq->resize_umem);
- cq->resize_umem = NULL;
return err;
}
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 2f1043690554..4163a6cb32d0 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -120,7 +120,6 @@ struct mlx4_ib_cq {
struct mlx4_ib_cq_resize *resize_buf;
struct mlx4_db db;
spinlock_t lock;
- struct ib_umem *resize_umem;
/* List of qps that it serves.*/
struct list_head send_qp_list;
struct list_head recv_qp_list;
--
2.52.0
^ permalink raw reply related
* [PATCH rdma-next 45/50] RDMA/mlx4: Use generic resize-CQ lock
From: Leon Romanovsky @ 2026-02-13 10:58 UTC (permalink / raw)
To: Jason Gunthorpe, Leon Romanovsky, Selvin Xavier, Kalesh AP,
Potnuri Bharat Teja, Michael Margolin, Gal Pressman,
Yossi Leybovich, Cheng Xu, Kai Shen, Chengchang Tang,
Junxian Huang, Abhijit Gangurde, Allen Hubbe, Krzysztof Czurylo,
Tatyana Nikolova, Long Li, Konstantin Taranov, Yishai Hadas,
Michal Kalderon, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Christian Benvenuti,
Nelson Escobar, Dennis Dalessandro, Bernard Metzler, Zhu Yanjun
Cc: linux-kernel, linux-rdma, linux-hyperv
In-Reply-To: <20260213-refactor-umem-v1-0-f3be85847922@nvidia.com>
From: Leon Romanovsky <leonro@nvidia.com>
Replace the open‑coded resize‑CQ lock with the standard core
implementation for better consistency and maintainability.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/infiniband/hw/mlx4/cq.c | 9 +--------
drivers/infiniband/hw/mlx4/mlx4_ib.h | 1 -
2 files changed, 1 insertion(+), 9 deletions(-)
diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index f4595afced45..ffc3902dc329 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -163,7 +163,6 @@ int mlx4_ib_create_user_cq(struct ib_cq *ibcq,
entries = roundup_pow_of_two(entries + 1);
cq->ibcq.cqe = entries - 1;
- mutex_init(&cq->resize_mutex);
spin_lock_init(&cq->lock);
INIT_LIST_HEAD(&cq->send_qp_list);
INIT_LIST_HEAD(&cq->recv_qp_list);
@@ -253,7 +252,6 @@ int mlx4_ib_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
entries = roundup_pow_of_two(entries + 1);
cq->ibcq.cqe = entries - 1;
- mutex_init(&cq->resize_mutex);
spin_lock_init(&cq->lock);
INIT_LIST_HEAD(&cq->send_qp_list);
INIT_LIST_HEAD(&cq->recv_qp_list);
@@ -369,12 +367,9 @@ int mlx4_ib_resize_cq(struct ib_cq *ibcq, unsigned int entries,
if (entries > dev->dev->caps.max_cqes + 1)
return -EINVAL;
- mutex_lock(&cq->resize_mutex);
err = mlx4_alloc_resize_umem(dev, cq, entries, udata);
- if (err) {
- mutex_unlock(&cq->resize_mutex);
+ if (err)
return err;
- }
mtt = cq->buf.mtt;
err = mlx4_cq_resize(dev->dev, &cq->mcq, entries, &cq->resize_buf->buf.mtt);
@@ -390,7 +385,6 @@ int mlx4_ib_resize_cq(struct ib_cq *ibcq, unsigned int entries,
kfree(cq->resize_buf);
cq->resize_buf = NULL;
cq->resize_umem = NULL;
- mutex_unlock(&cq->resize_mutex);
return 0;
@@ -401,7 +395,6 @@ int mlx4_ib_resize_cq(struct ib_cq *ibcq, unsigned int entries,
ib_umem_release(cq->resize_umem);
cq->resize_umem = NULL;
- mutex_unlock(&cq->resize_mutex);
return err;
}
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 5a799d6df93e..2f1043690554 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -120,7 +120,6 @@ struct mlx4_ib_cq {
struct mlx4_ib_cq_resize *resize_buf;
struct mlx4_db db;
spinlock_t lock;
- struct mutex resize_mutex;
struct ib_umem *resize_umem;
/* List of qps that it serves.*/
struct list_head send_qp_list;
--
2.52.0
^ permalink raw reply related
* [PATCH rdma-next 40/50] RDMA: Properly propagate the number of CQEs as unsigned int
From: Leon Romanovsky @ 2026-02-13 10:58 UTC (permalink / raw)
To: Jason Gunthorpe, Leon Romanovsky, Selvin Xavier, Kalesh AP,
Potnuri Bharat Teja, Michael Margolin, Gal Pressman,
Yossi Leybovich, Cheng Xu, Kai Shen, Chengchang Tang,
Junxian Huang, Abhijit Gangurde, Allen Hubbe, Krzysztof Czurylo,
Tatyana Nikolova, Long Li, Konstantin Taranov, Yishai Hadas,
Michal Kalderon, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Christian Benvenuti,
Nelson Escobar, Dennis Dalessandro, Bernard Metzler, Zhu Yanjun
Cc: linux-kernel, linux-rdma, linux-hyperv
In-Reply-To: <20260213-refactor-umem-v1-0-f3be85847922@nvidia.com>
From: Leon Romanovsky <leonro@nvidia.com>
Instead of checking whether the number of CQEs is negative or zero, fix the
.resize_user_cq() declaration to use unsigned int. This better reflects the
expected value range. The sanity check is then handled correctly in
ib_uvbers.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/infiniband/core/uverbs_cmd.c | 3 +++
drivers/infiniband/hw/bnxt_re/ib_verbs.c | 8 +++----
drivers/infiniband/hw/bnxt_re/ib_verbs.h | 3 ++-
drivers/infiniband/hw/irdma/verbs.c | 2 +-
drivers/infiniband/hw/mlx4/cq.c | 5 +++--
drivers/infiniband/hw/mlx4/mlx4_ib.h | 3 ++-
drivers/infiniband/hw/mlx5/cq.c | 10 +++------
drivers/infiniband/hw/mlx5/mlx5_ib.h | 3 ++-
drivers/infiniband/hw/mthca/mthca_provider.c | 5 +++--
drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 12 +++++------
drivers/infiniband/hw/ocrdma/ocrdma_verbs.h | 3 ++-
drivers/infiniband/sw/rdmavt/cq.c | 10 ++-------
drivers/infiniband/sw/rdmavt/cq.h | 2 +-
drivers/infiniband/sw/rxe/rxe_cq.c | 31 ----------------------------
drivers/infiniband/sw/rxe/rxe_loc.h | 3 ---
drivers/infiniband/sw/rxe/rxe_verbs.c | 9 ++++----
include/rdma/ib_verbs.h | 2 +-
17 files changed, 38 insertions(+), 76 deletions(-)
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 57697738fd25..b4b0c7c92fb1 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -1147,6 +1147,9 @@ static int ib_uverbs_resize_cq(struct uverbs_attr_bundle *attrs)
if (ret)
return ret;
+ if (!cmd.cqe)
+ return -EINVAL;
+
cq = uobj_get_obj_read(cq, UVERBS_OBJECT_CQ, cmd.cq_handle, attrs);
if (IS_ERR(cq))
return PTR_ERR(cq);
diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
index 16bb586d68c7..d652018c19b3 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
@@ -3324,7 +3324,8 @@ static void bnxt_re_resize_cq_complete(struct bnxt_re_cq *cq)
}
}
-int bnxt_re_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata)
+int bnxt_re_resize_cq(struct ib_cq *ibcq, unsigned int cqe,
+ struct ib_udata *udata)
{
struct bnxt_qplib_sg_info sg_info = {};
struct bnxt_qplib_dpi *orig_dpi = NULL;
@@ -3346,11 +3347,8 @@ int bnxt_re_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata)
}
/* Check the requested cq depth out of supported depth */
- if (cqe < 1 || cqe > dev_attr->max_cq_wqes) {
- ibdev_err(&rdev->ibdev, "Resize CQ %#x failed - out of range cqe %d",
- cq->qplib_cq.id, cqe);
+ if (cqe > dev_attr->max_cq_wqes)
return -EINVAL;
- }
uctx = rdma_udata_to_drv_context(udata, struct bnxt_re_ucontext, ib_uctx);
entries = bnxt_re_init_depth(cqe + 1, uctx);
diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.h b/drivers/infiniband/hw/bnxt_re/ib_verbs.h
index cac3e10b73f6..7890d6ebad90 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.h
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.h
@@ -249,7 +249,8 @@ int bnxt_re_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
int bnxt_re_create_user_cq(struct ib_cq *ibcq,
const struct ib_cq_init_attr *attr,
struct uverbs_attr_bundle *attrs);
-int bnxt_re_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata);
+int bnxt_re_resize_cq(struct ib_cq *ibcq, unsigned int cqe,
+ struct ib_udata *udata);
int bnxt_re_destroy_cq(struct ib_cq *cq, struct ib_udata *udata);
int bnxt_re_poll_cq(struct ib_cq *cq, int num_entries, struct ib_wc *wc);
int bnxt_re_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify_flags flags);
diff --git a/drivers/infiniband/hw/irdma/verbs.c b/drivers/infiniband/hw/irdma/verbs.c
index d5442aebf1ac..f20f53ecd869 100644
--- a/drivers/infiniband/hw/irdma/verbs.c
+++ b/drivers/infiniband/hw/irdma/verbs.c
@@ -2012,7 +2012,7 @@ static int irdma_destroy_cq(struct ib_cq *ib_cq, struct ib_udata *udata)
* @entries: desired cq size
* @udata: user data
*/
-static int irdma_resize_cq(struct ib_cq *ibcq, int entries,
+static int irdma_resize_cq(struct ib_cq *ibcq, unsigned int entries,
struct ib_udata *udata)
{
struct irdma_resize_cq_req req = {};
diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index 05fad06b89c2..f4595afced45 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -351,14 +351,15 @@ static int mlx4_alloc_resize_umem(struct mlx4_ib_dev *dev, struct mlx4_ib_cq *cq
return err;
}
-int mlx4_ib_resize_cq(struct ib_cq *ibcq, int entries, struct ib_udata *udata)
+int mlx4_ib_resize_cq(struct ib_cq *ibcq, unsigned int entries,
+ struct ib_udata *udata)
{
struct mlx4_ib_dev *dev = to_mdev(ibcq->device);
struct mlx4_ib_cq *cq = to_mcq(ibcq);
struct mlx4_mtt mtt;
int err;
- if (entries < 1 || entries > dev->dev->caps.max_cqes)
+ if (entries > dev->dev->caps.max_cqes)
return -EINVAL;
entries = roundup_pow_of_two(entries + 1);
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 6a7ed5225c7d..5a799d6df93e 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -767,7 +767,8 @@ struct ib_mr *mlx4_ib_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
int mlx4_ib_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents,
unsigned int *sg_offset);
int mlx4_ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period);
-int mlx4_ib_resize_cq(struct ib_cq *ibcq, int entries, struct ib_udata *udata);
+int mlx4_ib_resize_cq(struct ib_cq *ibcq, unsigned int entries,
+ struct ib_udata *udata);
int mlx4_ib_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
struct uverbs_attr_bundle *attrs);
int mlx4_ib_create_user_cq(struct ib_cq *ibcq,
diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index ce20af01cde0..78c3494517d7 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -1253,7 +1253,8 @@ static int resize_user(struct mlx5_ib_dev *dev, struct mlx5_ib_cq *cq,
return 0;
}
-int mlx5_ib_resize_cq(struct ib_cq *ibcq, int entries, struct ib_udata *udata)
+int mlx5_ib_resize_cq(struct ib_cq *ibcq, unsigned int entries,
+ struct ib_udata *udata)
{
struct mlx5_ib_dev *dev = to_mdev(ibcq->device);
struct mlx5_ib_cq *cq = to_mcq(ibcq);
@@ -1273,13 +1274,8 @@ int mlx5_ib_resize_cq(struct ib_cq *ibcq, int entries, struct ib_udata *udata)
return -ENOSYS;
}
- if (entries < 1 ||
- entries > (1 << MLX5_CAP_GEN(dev->mdev, log_max_cq_sz))) {
- mlx5_ib_warn(dev, "wrong entries number %d, max %d\n",
- entries,
- 1 << MLX5_CAP_GEN(dev->mdev, log_max_cq_sz));
+ if (entries > (1 << MLX5_CAP_GEN(dev->mdev, log_max_cq_sz)))
return -EINVAL;
- }
entries = roundup_pow_of_two(entries + 1);
if (entries > (1 << MLX5_CAP_GEN(dev->mdev, log_max_cq_sz)) + 1)
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 2556e326afde..e99a647ed62d 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -1380,7 +1380,8 @@ int mlx5_ib_pre_destroy_cq(struct ib_cq *cq);
void mlx5_ib_post_destroy_cq(struct ib_cq *cq);
int mlx5_ib_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags flags);
int mlx5_ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period);
-int mlx5_ib_resize_cq(struct ib_cq *ibcq, int entries, struct ib_udata *udata);
+int mlx5_ib_resize_cq(struct ib_cq *ibcq, unsigned int entries,
+ struct ib_udata *udata);
struct ib_mr *mlx5_ib_get_dma_mr(struct ib_pd *pd, int acc);
struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
u64 virt_addr, int access_flags,
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
index fd306a229318..85de004547ab 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -668,7 +668,8 @@ static int mthca_create_cq(struct ib_cq *ibcq,
return 0;
}
-static int mthca_resize_cq(struct ib_cq *ibcq, int entries, struct ib_udata *udata)
+static int mthca_resize_cq(struct ib_cq *ibcq, unsigned int entries,
+ struct ib_udata *udata)
{
struct mthca_dev *dev = to_mdev(ibcq->device);
struct mthca_cq *cq = to_mcq(ibcq);
@@ -676,7 +677,7 @@ static int mthca_resize_cq(struct ib_cq *ibcq, int entries, struct ib_udata *uda
u32 lkey;
int ret;
- if (entries < 1 || entries > dev->limits.max_cqes)
+ if (entries > dev->limits.max_cqes)
return -EINVAL;
mutex_lock(&cq->mutex);
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index 034d8b937a77..8445780c398f 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -1035,18 +1035,16 @@ int ocrdma_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
return 0;
}
-int ocrdma_resize_cq(struct ib_cq *ibcq, int new_cnt,
+int ocrdma_resize_cq(struct ib_cq *ibcq, unsigned int new_cnt,
struct ib_udata *udata)
{
- int status = 0;
struct ocrdma_cq *cq = get_ocrdma_cq(ibcq);
- if (new_cnt < 1 || new_cnt > cq->max_hw_cqe) {
- status = -EINVAL;
- return status;
- }
+ if (new_cnt > cq->max_hw_cqe)
+ return -EINVAL;
+
ibcq->cqe = new_cnt;
- return status;
+ return 0;
}
static void ocrdma_flush_cq(struct ocrdma_cq *cq)
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
index 4a572608fd9f..bbc08f88c046 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
@@ -74,7 +74,8 @@ int ocrdma_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
int ocrdma_create_user_cq(struct ib_cq *ibcq,
const struct ib_cq_init_attr *attr,
struct uverbs_attr_bundle *attrs);
-int ocrdma_resize_cq(struct ib_cq *, int cqe, struct ib_udata *);
+int ocrdma_resize_cq(struct ib_cq *ibcq, unsigned int cqe,
+ struct ib_udata *udata);
int ocrdma_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata);
int ocrdma_create_qp(struct ib_qp *qp, struct ib_qp_init_attr *attrs,
diff --git a/drivers/infiniband/sw/rdmavt/cq.c b/drivers/infiniband/sw/rdmavt/cq.c
index 1ae5d8c86acb..7be79274bafb 100644
--- a/drivers/infiniband/sw/rdmavt/cq.c
+++ b/drivers/infiniband/sw/rdmavt/cq.c
@@ -393,13 +393,7 @@ int rvt_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags notify_flags)
return ret;
}
-/*
- * rvt_resize_cq - change the size of the CQ
- * @ibcq: the completion queue
- *
- * Return: 0 for success.
- */
-int rvt_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata)
+int rvt_resize_cq(struct ib_cq *ibcq, unsigned int cqe, struct ib_udata *udata)
{
struct rvt_cq *cq = ibcq_to_rvtcq(ibcq);
u32 head, tail, n;
@@ -410,7 +404,7 @@ int rvt_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata)
struct rvt_cq_wc *old_u_wc = NULL;
__u64 offset = 0;
- if (cqe < 1 || cqe > rdi->dparms.props.max_cqe)
+ if (cqe > rdi->dparms.props.max_cqe)
return -EINVAL;
if (udata->outlen < sizeof(__u64))
diff --git a/drivers/infiniband/sw/rdmavt/cq.h b/drivers/infiniband/sw/rdmavt/cq.h
index 14ee2705c443..3827c0e6a0fb 100644
--- a/drivers/infiniband/sw/rdmavt/cq.h
+++ b/drivers/infiniband/sw/rdmavt/cq.h
@@ -15,7 +15,7 @@ int rvt_create_user_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
struct uverbs_attr_bundle *attrs);
int rvt_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata);
int rvt_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags notify_flags);
-int rvt_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata);
+int rvt_resize_cq(struct ib_cq *ibcq, unsigned int cqe, struct ib_udata *udata);
int rvt_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry);
int rvt_driver_cq_init(void);
void rvt_cq_exit(void);
diff --git a/drivers/infiniband/sw/rxe/rxe_cq.c b/drivers/infiniband/sw/rxe/rxe_cq.c
index fffd144d509e..eaf7802a5cbe 100644
--- a/drivers/infiniband/sw/rxe/rxe_cq.c
+++ b/drivers/infiniband/sw/rxe/rxe_cq.c
@@ -8,37 +8,6 @@
#include "rxe_loc.h"
#include "rxe_queue.h"
-int rxe_cq_chk_attr(struct rxe_dev *rxe, struct rxe_cq *cq,
- int cqe, int comp_vector)
-{
- int count;
-
- if (cqe <= 0) {
- rxe_dbg_dev(rxe, "cqe(%d) <= 0\n", cqe);
- goto err1;
- }
-
- if (cqe > rxe->attr.max_cqe) {
- rxe_dbg_dev(rxe, "cqe(%d) > max_cqe(%d)\n",
- cqe, rxe->attr.max_cqe);
- goto err1;
- }
-
- if (cq) {
- count = queue_count(cq->queue, QUEUE_TYPE_TO_CLIENT);
- if (cqe < count) {
- rxe_dbg_cq(cq, "cqe(%d) < current # elements in queue (%d)\n",
- cqe, count);
- goto err1;
- }
- }
-
- return 0;
-
-err1:
- return -EINVAL;
-}
-
int rxe_cq_from_init(struct rxe_dev *rxe, struct rxe_cq *cq, int cqe,
int comp_vector, struct ib_udata *udata,
struct rxe_create_cq_resp __user *uresp)
diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index 7992290886e1..e095c12699cb 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -18,9 +18,6 @@ void rxe_av_fill_ip_info(struct rxe_av *av, struct rdma_ah_attr *attr);
struct rxe_av *rxe_get_av(struct rxe_pkt_info *pkt, struct rxe_ah **ahp);
/* rxe_cq.c */
-int rxe_cq_chk_attr(struct rxe_dev *rxe, struct rxe_cq *cq,
- int cqe, int comp_vector);
-
int rxe_cq_from_init(struct rxe_dev *rxe, struct rxe_cq *cq, int cqe,
int comp_vector, struct ib_udata *udata,
struct rxe_create_cq_resp __user *uresp);
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index bc7c77ff3d90..f57b4ba22a4f 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -1139,7 +1139,8 @@ static int rxe_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
return err;
}
-static int rxe_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata)
+static int rxe_resize_cq(struct ib_cq *ibcq, unsigned int cqe,
+ struct ib_udata *udata)
{
struct rxe_cq *cq = to_rcq(ibcq);
struct rxe_dev *rxe = to_rdev(ibcq->device);
@@ -1150,9 +1151,9 @@ static int rxe_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata)
return -EINVAL;
uresp = udata->outbuf;
- err = rxe_cq_chk_attr(rxe, cq, cqe, 0);
- if (err)
- return err;
+ if (cqe > rxe->attr.max_cqe ||
+ cqe < queue_count(cq->queue, QUEUE_TYPE_TO_CLIENT))
+ return -EINVAL;
err = rxe_cq_resize_queue(cq, cqe, uresp, udata);
if (err)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 94bb3cc4c67a..7d32d02c35e3 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2534,7 +2534,7 @@ struct ib_device_ops {
struct uverbs_attr_bundle *attrs);
int (*modify_cq)(struct ib_cq *cq, u16 cq_count, u16 cq_period);
int (*destroy_cq)(struct ib_cq *cq, struct ib_udata *udata);
- int (*resize_user_cq)(struct ib_cq *cq, int cqe,
+ int (*resize_user_cq)(struct ib_cq *cq, unsigned int cqe,
struct ib_udata *udata);
/*
* pre_destroy_cq - Prevent a cq from generating any new work
--
2.52.0
^ permalink raw reply related
* [PATCH rdma-next 43/50] RDMA/bnxt_re: Rely on common resize‑CQ locking
From: Leon Romanovsky @ 2026-02-13 10:58 UTC (permalink / raw)
To: Jason Gunthorpe, Leon Romanovsky, Selvin Xavier, Kalesh AP,
Potnuri Bharat Teja, Michael Margolin, Gal Pressman,
Yossi Leybovich, Cheng Xu, Kai Shen, Chengchang Tang,
Junxian Huang, Abhijit Gangurde, Allen Hubbe, Krzysztof Czurylo,
Tatyana Nikolova, Long Li, Konstantin Taranov, Yishai Hadas,
Michal Kalderon, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Christian Benvenuti,
Nelson Escobar, Dennis Dalessandro, Bernard Metzler, Zhu Yanjun
Cc: linux-kernel, linux-rdma, linux-hyperv
In-Reply-To: <20260213-refactor-umem-v1-0-f3be85847922@nvidia.com>
From: Leon Romanovsky <leonro@nvidia.com>
After introducing a shared mutex to protect against concurrent
resize‑CQ operations, update the bnxt_re driver to use this mechanism.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/infiniband/hw/bnxt_re/ib_verbs.c | 6 ------
1 file changed, 6 deletions(-)
diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
index 2aecfbbb7eaf..d544a4fb1e96 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
@@ -3326,12 +3326,6 @@ int bnxt_re_resize_cq(struct ib_cq *ibcq, unsigned int cqe,
rdev = cq->rdev;
dev_attr = rdev->dev_attr;
- if (cq->resize_umem) {
- ibdev_err(&rdev->ibdev, "Resize CQ %#x failed - Busy",
- cq->qplib_cq.id);
- return -EBUSY;
- }
-
/* Check the requested cq depth out of supported depth */
if (cqe > dev_attr->max_cq_wqes)
return -EINVAL;
--
2.52.0
^ permalink raw reply related
* [PATCH rdma-next 42/50] RDMA/bnxt_re: Complete CQ resize in a single step
From: Leon Romanovsky @ 2026-02-13 10:58 UTC (permalink / raw)
To: Jason Gunthorpe, Leon Romanovsky, Selvin Xavier, Kalesh AP,
Potnuri Bharat Teja, Michael Margolin, Gal Pressman,
Yossi Leybovich, Cheng Xu, Kai Shen, Chengchang Tang,
Junxian Huang, Abhijit Gangurde, Allen Hubbe, Krzysztof Czurylo,
Tatyana Nikolova, Long Li, Konstantin Taranov, Yishai Hadas,
Michal Kalderon, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Christian Benvenuti,
Nelson Escobar, Dennis Dalessandro, Bernard Metzler, Zhu Yanjun
Cc: linux-kernel, linux-rdma, linux-hyperv
In-Reply-To: <20260213-refactor-umem-v1-0-f3be85847922@nvidia.com>
From: Leon Romanovsky <leonro@nvidia.com>
There is no need to defer the CQ resize operation, as it is intended to
be completed in one pass. The current bnxt_re_resize_cq() implementation
does not handle concurrent CQ resize requests, and this will be addressed
in the following patches.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/infiniband/hw/bnxt_re/ib_verbs.c | 33 +++++++++-----------------------
1 file changed, 9 insertions(+), 24 deletions(-)
diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
index d652018c19b3..2aecfbbb7eaf 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
@@ -3309,20 +3309,6 @@ int bnxt_re_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
return rc;
}
-static void bnxt_re_resize_cq_complete(struct bnxt_re_cq *cq)
-{
- struct bnxt_re_dev *rdev = cq->rdev;
-
- bnxt_qplib_resize_cq_complete(&rdev->qplib_res, &cq->qplib_cq);
-
- cq->qplib_cq.max_wqe = cq->resize_cqe;
- if (cq->resize_umem) {
- ib_umem_release(cq->ib_cq.umem);
- cq->ib_cq.umem = cq->resize_umem;
- cq->resize_umem = NULL;
- cq->resize_cqe = 0;
- }
-}
int bnxt_re_resize_cq(struct ib_cq *ibcq, unsigned int cqe,
struct ib_udata *udata)
@@ -3387,7 +3373,15 @@ int bnxt_re_resize_cq(struct ib_cq *ibcq, unsigned int cqe,
goto fail;
}
- cq->ib_cq.cqe = cq->resize_cqe;
+ bnxt_qplib_resize_cq_complete(&rdev->qplib_res, &cq->qplib_cq);
+
+ cq->qplib_cq.max_wqe = cq->resize_cqe;
+ ib_umem_release(cq->ib_cq.umem);
+ cq->ib_cq.umem = cq->resize_umem;
+ cq->resize_umem = NULL;
+ cq->resize_cqe = 0;
+
+ cq->ib_cq.cqe = entries;
atomic_inc(&rdev->stats.res.resize_count);
return 0;
@@ -3907,15 +3901,6 @@ int bnxt_re_poll_cq(struct ib_cq *ib_cq, int num_entries, struct ib_wc *wc)
struct bnxt_re_sqp_entries *sqp_entry = NULL;
unsigned long flags;
- /* User CQ; the only processing we do is to
- * complete any pending CQ resize operation.
- */
- if (cq->ib_cq.umem) {
- if (cq->resize_umem)
- bnxt_re_resize_cq_complete(cq);
- return 0;
- }
-
spin_lock_irqsave(&cq->cq_lock, flags);
budget = min_t(u32, num_entries, cq->max_cql);
num_entries = budget;
--
2.52.0
^ permalink raw reply related
* [PATCH rdma-next 41/50] RDMA/core: Generalize CQ resize locking
From: Leon Romanovsky @ 2026-02-13 10:58 UTC (permalink / raw)
To: Jason Gunthorpe, Leon Romanovsky, Selvin Xavier, Kalesh AP,
Potnuri Bharat Teja, Michael Margolin, Gal Pressman,
Yossi Leybovich, Cheng Xu, Kai Shen, Chengchang Tang,
Junxian Huang, Abhijit Gangurde, Allen Hubbe, Krzysztof Czurylo,
Tatyana Nikolova, Long Li, Konstantin Taranov, Yishai Hadas,
Michal Kalderon, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Christian Benvenuti,
Nelson Escobar, Dennis Dalessandro, Bernard Metzler, Zhu Yanjun
Cc: linux-kernel, linux-rdma, linux-hyperv
In-Reply-To: <20260213-refactor-umem-v1-0-f3be85847922@nvidia.com>
From: Leon Romanovsky <leonro@nvidia.com>
The CQ resize path must be protected from concurrent execution because it
updates in-kernel objects. Some drivers did not provide any locking,
leading to inconsistent behavior.
Rely on the core mutex for synchronization and drop the various ad‑hoc
locking implementations in individual drivers.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/infiniband/core/uverbs_cmd.c | 1 +
drivers/infiniband/core/uverbs_std_types_cq.c | 1 +
drivers/infiniband/core/verbs.c | 2 ++
include/rdma/ib_verbs.h | 3 +++
4 files changed, 7 insertions(+)
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index b4b0c7c92fb1..1348ebd7a1c3 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -1067,6 +1067,7 @@ static int create_cq(struct uverbs_attr_bundle *attrs,
cq->event_handler = ib_uverbs_cq_event_handler;
cq->cq_context = ev_file ? &ev_file->ev_queue : NULL;
atomic_set(&cq->usecnt, 0);
+ mutex_init(&cq->resize_mutex);
rdma_restrack_new(&cq->res, RDMA_RESTRACK_CQ);
rdma_restrack_set_name(&cq->res, NULL);
diff --git a/drivers/infiniband/core/uverbs_std_types_cq.c b/drivers/infiniband/core/uverbs_std_types_cq.c
index a12e3184dd5c..c572f528579d 100644
--- a/drivers/infiniband/core/uverbs_std_types_cq.c
+++ b/drivers/infiniband/core/uverbs_std_types_cq.c
@@ -195,6 +195,7 @@ static int UVERBS_HANDLER(UVERBS_METHOD_CQ_CREATE)(
*/
cq->umem = umem;
atomic_set(&cq->usecnt, 0);
+ mutex_init(&cq->resize_mutex);
rdma_restrack_new(&cq->res, RDMA_RESTRACK_CQ);
rdma_restrack_set_name(&cq->res, NULL);
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 5f59487fc9d4..b308100ba964 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -2257,6 +2257,8 @@ int ib_destroy_cq_user(struct ib_cq *cq, struct ib_udata *udata)
if (ret)
return ret;
+ if (udata)
+ mutex_destroy(&cq->resize_mutex);
ib_umem_release(cq->umem);
rdma_restrack_del(&cq->res);
kfree(cq);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 7d32d02c35e3..48340b39ab26 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1638,8 +1638,11 @@ struct ib_cq {
struct ib_wc *wc;
struct list_head pool_entry;
union {
+ /* Kernel CQs */
struct irq_poll iop;
struct work_struct work;
+ /* Uverbs CQs */
+ struct mutex resize_mutex;
};
struct workqueue_struct *comp_wq;
struct dim *dim;
--
2.52.0
^ permalink raw reply related
* [PATCH rdma-next 36/50] RDMA/mlx5: Remove support for resizing kernel CQs
From: Leon Romanovsky @ 2026-02-13 10:58 UTC (permalink / raw)
To: Jason Gunthorpe, Leon Romanovsky, Selvin Xavier, Kalesh AP,
Potnuri Bharat Teja, Michael Margolin, Gal Pressman,
Yossi Leybovich, Cheng Xu, Kai Shen, Chengchang Tang,
Junxian Huang, Abhijit Gangurde, Allen Hubbe, Krzysztof Czurylo,
Tatyana Nikolova, Long Li, Konstantin Taranov, Yishai Hadas,
Michal Kalderon, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Christian Benvenuti,
Nelson Escobar, Dennis Dalessandro, Bernard Metzler, Zhu Yanjun
Cc: linux-kernel, linux-rdma, linux-hyperv
In-Reply-To: <20260213-refactor-umem-v1-0-f3be85847922@nvidia.com>
From: Leon Romanovsky <leonro@nvidia.com>
No ULP users rely on CQ resize support, so drop the unused code.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/infiniband/hw/mlx5/cq.c | 161 +++++-----------------------------------
1 file changed, 18 insertions(+), 143 deletions(-)
diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index 52a435efd0de..ce20af01cde0 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -74,11 +74,6 @@ static void *get_cqe(struct mlx5_ib_cq *cq, int n)
return mlx5_frag_buf_get_wqe(&cq->buf.fbc, n);
}
-static u8 sw_ownership_bit(int n, int nent)
-{
- return (n & nent) ? 1 : 0;
-}
-
static void *get_sw_cqe(struct mlx5_ib_cq *cq, int n)
{
void *cqe = get_cqe(cq, n & cq->ibcq.cqe);
@@ -1258,87 +1253,11 @@ static int resize_user(struct mlx5_ib_dev *dev, struct mlx5_ib_cq *cq,
return 0;
}
-static int resize_kernel(struct mlx5_ib_dev *dev, struct mlx5_ib_cq *cq,
- int entries, int cqe_size)
-{
- int err;
-
- cq->resize_buf = kzalloc(sizeof(*cq->resize_buf), GFP_KERNEL);
- if (!cq->resize_buf)
- return -ENOMEM;
-
- err = alloc_cq_frag_buf(dev, cq->resize_buf, entries, cqe_size);
- if (err)
- goto ex;
-
- init_cq_frag_buf(cq->resize_buf);
-
- return 0;
-
-ex:
- kfree(cq->resize_buf);
- return err;
-}
-
-static int copy_resize_cqes(struct mlx5_ib_cq *cq)
-{
- struct mlx5_ib_dev *dev = to_mdev(cq->ibcq.device);
- struct mlx5_cqe64 *scqe64;
- struct mlx5_cqe64 *dcqe64;
- void *start_cqe;
- void *scqe;
- void *dcqe;
- int ssize;
- int dsize;
- int i;
- u8 sw_own;
-
- ssize = cq->buf.cqe_size;
- dsize = cq->resize_buf->cqe_size;
- if (ssize != dsize) {
- mlx5_ib_warn(dev, "resize from different cqe size is not supported\n");
- return -EINVAL;
- }
-
- i = cq->mcq.cons_index;
- scqe = get_sw_cqe(cq, i);
- scqe64 = ssize == 64 ? scqe : scqe + 64;
- start_cqe = scqe;
- if (!scqe) {
- mlx5_ib_warn(dev, "expected cqe in sw ownership\n");
- return -EINVAL;
- }
-
- while (get_cqe_opcode(scqe64) != MLX5_CQE_RESIZE_CQ) {
- dcqe = mlx5_frag_buf_get_wqe(&cq->resize_buf->fbc,
- (i + 1) & cq->resize_buf->nent);
- dcqe64 = dsize == 64 ? dcqe : dcqe + 64;
- sw_own = sw_ownership_bit(i + 1, cq->resize_buf->nent);
- memcpy(dcqe, scqe, dsize);
- dcqe64->op_own = (dcqe64->op_own & ~MLX5_CQE_OWNER_MASK) | sw_own;
-
- ++i;
- scqe = get_sw_cqe(cq, i);
- scqe64 = ssize == 64 ? scqe : scqe + 64;
- if (!scqe) {
- mlx5_ib_warn(dev, "expected cqe in sw ownership\n");
- return -EINVAL;
- }
-
- if (scqe == start_cqe) {
- pr_warn("resize CQ failed to get resize CQE, CQN 0x%x\n",
- cq->mcq.cqn);
- return -ENOMEM;
- }
- }
- ++cq->mcq.cons_index;
- return 0;
-}
-
int mlx5_ib_resize_cq(struct ib_cq *ibcq, int entries, struct ib_udata *udata)
{
struct mlx5_ib_dev *dev = to_mdev(ibcq->device);
struct mlx5_ib_cq *cq = to_mcq(ibcq);
+ unsigned long page_size;
void *cqc;
u32 *in;
int err;
@@ -1348,7 +1267,6 @@ int mlx5_ib_resize_cq(struct ib_cq *ibcq, int entries, struct ib_udata *udata)
unsigned int page_shift;
int inlen;
int cqe_size;
- unsigned long flags;
if (!MLX5_CAP_GEN(dev->mdev, cq_resize)) {
pr_info("Firmware does not support resize CQ\n");
@@ -1371,34 +1289,19 @@ int mlx5_ib_resize_cq(struct ib_cq *ibcq, int entries, struct ib_udata *udata)
return 0;
mutex_lock(&cq->resize_mutex);
- if (udata) {
- unsigned long page_size;
-
- err = resize_user(dev, cq, entries, udata, &cqe_size);
- if (err)
- goto ex;
-
- page_size = mlx5_umem_find_best_cq_quantized_pgoff(
- cq->resize_umem, cqc, log_page_size,
- MLX5_ADAPTER_PAGE_SHIFT, page_offset, 64,
- &page_offset_quantized);
- if (!page_size) {
- err = -EINVAL;
- goto ex_resize;
- }
- npas = ib_umem_num_dma_blocks(cq->resize_umem, page_size);
- page_shift = order_base_2(page_size);
- } else {
- struct mlx5_frag_buf *frag_buf;
+ err = resize_user(dev, cq, entries, udata, &cqe_size);
+ if (err)
+ goto ex;
- cqe_size = 64;
- err = resize_kernel(dev, cq, entries, cqe_size);
- if (err)
- goto ex;
- frag_buf = &cq->resize_buf->frag_buf;
- npas = frag_buf->npages;
- page_shift = frag_buf->page_shift;
+ page_size = mlx5_umem_find_best_cq_quantized_pgoff(
+ cq->resize_umem, cqc, log_page_size, MLX5_ADAPTER_PAGE_SHIFT,
+ page_offset, 64, &page_offset_quantized);
+ if (!page_size) {
+ err = -EINVAL;
+ goto ex_resize;
}
+ npas = ib_umem_num_dma_blocks(cq->resize_umem, page_size);
+ page_shift = order_base_2(page_size);
inlen = MLX5_ST_SZ_BYTES(modify_cq_in) +
MLX5_FLD_SZ_BYTES(modify_cq_in, pas[0]) * npas;
@@ -1410,11 +1313,7 @@ int mlx5_ib_resize_cq(struct ib_cq *ibcq, int entries, struct ib_udata *udata)
}
pas = (__be64 *)MLX5_ADDR_OF(modify_cq_in, in, pas);
- if (udata)
- mlx5_ib_populate_pas(cq->resize_umem, 1UL << page_shift, pas,
- 0);
- else
- mlx5_fill_page_frag_array(&cq->resize_buf->frag_buf, pas);
+ mlx5_ib_populate_pas(cq->resize_umem, 1UL << page_shift, pas, 0);
MLX5_SET(modify_cq_in, in,
modify_field_select_resize_field_select.resize_field_select.resize_field_select,
@@ -1440,31 +1339,10 @@ int mlx5_ib_resize_cq(struct ib_cq *ibcq, int entries, struct ib_udata *udata)
if (err)
goto ex_alloc;
- if (udata) {
- cq->ibcq.cqe = entries - 1;
- ib_umem_release(cq->ibcq.umem);
- cq->ibcq.umem = cq->resize_umem;
- cq->resize_umem = NULL;
- } else {
- struct mlx5_ib_cq_buf tbuf;
- int resized = 0;
-
- spin_lock_irqsave(&cq->lock, flags);
- if (cq->resize_buf) {
- err = copy_resize_cqes(cq);
- if (!err) {
- tbuf = cq->buf;
- cq->buf = *cq->resize_buf;
- kfree(cq->resize_buf);
- cq->resize_buf = NULL;
- resized = 1;
- }
- }
- cq->ibcq.cqe = entries - 1;
- spin_unlock_irqrestore(&cq->lock, flags);
- if (resized)
- free_cq_buf(dev, &tbuf);
- }
+ cq->ibcq.cqe = entries - 1;
+ ib_umem_release(cq->ibcq.umem);
+ cq->ibcq.umem = cq->resize_umem;
+ cq->resize_umem = NULL;
mutex_unlock(&cq->resize_mutex);
kvfree(in);
@@ -1475,10 +1353,7 @@ int mlx5_ib_resize_cq(struct ib_cq *ibcq, int entries, struct ib_udata *udata)
ex_resize:
ib_umem_release(cq->resize_umem);
- if (!udata) {
- free_cq_buf(dev, cq->resize_buf);
- cq->resize_buf = NULL;
- }
+ cq->resize_umem = NULL;
ex:
mutex_unlock(&cq->resize_mutex);
return err;
--
2.52.0
^ permalink raw reply related
* [PATCH rdma-next 39/50] RDMA/rxe: Remove unused kernel‑side CQ resize support
From: Leon Romanovsky @ 2026-02-13 10:58 UTC (permalink / raw)
To: Jason Gunthorpe, Leon Romanovsky, Selvin Xavier, Kalesh AP,
Potnuri Bharat Teja, Michael Margolin, Gal Pressman,
Yossi Leybovich, Cheng Xu, Kai Shen, Chengchang Tang,
Junxian Huang, Abhijit Gangurde, Allen Hubbe, Krzysztof Czurylo,
Tatyana Nikolova, Long Li, Konstantin Taranov, Yishai Hadas,
Michal Kalderon, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Christian Benvenuti,
Nelson Escobar, Dennis Dalessandro, Bernard Metzler, Zhu Yanjun
Cc: linux-kernel, linux-rdma, linux-hyperv
In-Reply-To: <20260213-refactor-umem-v1-0-f3be85847922@nvidia.com>
From: Leon Romanovsky <leonro@nvidia.com>
CQ resizing is only used by uverbs; the kernel‑side CQ resize path has
no users and can be removed.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/infiniband/sw/rxe/rxe_verbs.c | 27 +++++++--------------------
1 file changed, 7 insertions(+), 20 deletions(-)
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index 72e3019ed1cb..bc7c77ff3d90 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -1146,32 +1146,19 @@ static int rxe_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata)
struct rxe_resize_cq_resp __user *uresp = NULL;
int err;
- if (udata) {
- if (udata->outlen < sizeof(*uresp)) {
- err = -EINVAL;
- rxe_dbg_cq(cq, "malformed udata\n");
- goto err_out;
- }
- uresp = udata->outbuf;
- }
+ if (udata->outlen < sizeof(*uresp))
+ return -EINVAL;
+ uresp = udata->outbuf;
err = rxe_cq_chk_attr(rxe, cq, cqe, 0);
- if (err) {
- rxe_dbg_cq(cq, "bad attr, err = %d\n", err);
- goto err_out;
- }
+ if (err)
+ return err;
err = rxe_cq_resize_queue(cq, cqe, uresp, udata);
- if (err) {
- rxe_dbg_cq(cq, "resize cq failed, err = %d\n", err);
- goto err_out;
- }
+ if (err)
+ return err;
return 0;
-
-err_out:
- rxe_err_cq(cq, "returned err = %d\n", err);
- return err;
}
static int rxe_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc)
--
2.52.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox