From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out199-2.us.a.mail.aliyun.com (out199-2.us.a.mail.aliyun.com [47.90.199.2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4D8DE303A0F; Tue, 24 Feb 2026 06:56:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=47.90.199.2 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771916212; cv=none; b=Bu2AnUm/RK7ziKnW0NK5vUlTHqeQtNpJI0IOtUDC4EEVtKCahfeHkqPxNl0rV+el+3HErDWSS7jVFli21mcYmjkKIzqxn/Md7bEH3A4qJfg1SU12P1LZaTMTb+WmoRZMz+SqmLtgpz2eDqx8gm+ffJsMRmNJ75cD/GzXhXvXuZQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771916212; c=relaxed/simple; bh=G6VodxYEDJJTVtJIZDA/ze7Y8lG/lnGunApcDXEfO0I=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=ZjiJZy4y/k99Ibc7vq88WVjecD1mzfJgPRZuRryqTg5ORYA1MfC3WSz+3zqD5TcAAgOf7eOew3nmaV7mjqkvbQgqbqo8NT91hkHpBExH27Wq85u8liAr+yujAC89ABcMRfr0qssfL97qGF5KHnJfcLPTSyuY6PDd/mbBi5PTnYM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=mzJmC5bS; arc=none smtp.client-ip=47.90.199.2 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="mzJmC5bS" DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1771916186; h=Date:From:To:Subject:Message-ID:MIME-Version:Content-Type; bh=Lx4dBhrqrvG3MFsd9X6PCdB1MPjGTx47ZXklVswma3U=; b=mzJmC5bS8jHT+XQo7GWETcRsQ6bEKrzW+BgzHcyOSf0WpnPQPyXZoep4zcU6nX7GCnta5O7N8hWwtuiQkZC2nxv0TvMqEUdI1CDGnrcpHwnZOGGFkZd+tUW9LuDBbnvGx5vKM+plEVdLeKOoR75TBVjaqweMTZrOSiE25HzL05k= Received: from localhost(mailfrom:alibuda@linux.alibaba.com fp:SMTPD_---0WzigzmF_1771916184 cluster:ay36) by smtp.aliyun-inc.com; Tue, 24 Feb 2026 14:56:25 +0800 Date: Tue, 24 Feb 2026 14:56:24 +0800 From: "D. Wythe" To: Mahanta Jambigi Cc: "D. Wythe" , "David S. Miller" , Dust Li , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Sidraya Jayagond , Wenjia Zhang , Simon Horman , Tony Lu , Wen Gu , linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org, linux-s390@vger.kernel.org, netdev@vger.kernel.org, oliver.yang@linux.alibaba.com, pasic@linux.ibm.com Subject: Re: [PATCH RFC net-next] net/smc: transition to RDMA core CQ pooling Message-ID: <20260224065624.GA67725@j66a10360.sqa.eu95> References: <20260202094800.30373-1-alibuda@linux.alibaba.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) On Mon, Feb 23, 2026 at 06:01:38PM +0530, Mahanta Jambigi wrote: > > > On 02/02/26 3:18 pm, D. Wythe wrote: > > The current SMC-R implementation relies on global per-device CQs > > and manual polling within tasklets, which introduces severe > > scalability bottlenecks due to global lock contention and tasklet > > scheduling overhead, resulting in poor performance as concurrency > > increases. > > > > Refactor the completion handling to utilize the ib_cqe API and > > standard RDMA core CQ pooling. This transition provides several key > > advantages: > > > > 1. Multi-CQ: Shift from a single shared per-device CQ to multiple > > link-specific CQs via the CQ pool. This allows completion processing > > to be parallelized across multiple CPU cores, effectively eliminating > > the global CQ bottleneck. > > > > 2. Leverage DIM: Utilizing the standard CQ pool with IB_POLL_SOFTIRQ > > enables Dynamic Interrupt Moderation from the RDMA core, optimizing > > interrupt frequency and reducing CPU load under high pressure. > > > > 3. O(1) Context Retrieval: Replaces the expensive wr_id based lookup > > logic (e.g., smc_wr_tx_find_pending_index) with direct context retrieval > > using container_of() on the embedded ib_cqe. > > > > 4. Code Simplification: This refactoring results in a reduction of > > ~150 lines of code. It removes redundant sequence tracking, complex lookup > > helpers, and manual CQ management, significantly improving maintainability. > > > > Performance Test: redis-benchmark with max 32 connections per QP > > Data format: Requests Per Second (RPS), Percentage in brackets > > represents the gain/loss compared to TCP. > > > > | Clients | TCP | SMC (original) | SMC (cq_pool) | > > |---------|----------|---------------------|---------------------| > > | c = 1 | 24449 | 31172 (+27%) | 34039 (+39%) | > > | c = 2 | 46420 | 53216 (+14%) | 64391 (+38%) | > > | c = 16 | 159673 | 83668 (-48%) <-- | 216947 (+36%) | > > | c = 32 | 164956 | 97631 (-41%) <-- | 249376 (+51%) | > > | c = 64 | 166322 | 118192 (-29%) <-- | 249488 (+50%) | > > | c = 128 | 167700 | 121497 (-27%) <-- | 249480 (+48%) | > > | c = 256 | 175021 | 146109 (-16%) <-- | 240384 (+37%) | > > | c = 512 | 168987 | 101479 (-40%) <-- | 226634 (+34%) | > > > > The results demonstrate that this optimization effectively resolves the > > scalability bottleneck, with RPS increasing by over 110% at c=64 > > compared to the original implementation. > > > > Signed-off-by: D. Wythe > > --- > > net/smc/smc_core.c | 8 +- > > net/smc/smc_core.h | 16 ++- > > net/smc/smc_ib.c | 114 ++++++------------- > > net/smc/smc_ib.h | 5 - > > net/smc/smc_tx.c | 1 - > > net/smc/smc_wr.c | 267 ++++++++++++++++----------------------------- > > net/smc/smc_wr.h | 38 ++----- > > 7 files changed, 150 insertions(+), 299 deletions(-) > > > > diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c > > index 8aca5dc54be7..9590c8aed3dd 100644 > > --- a/net/smc/smc_core.c > > +++ b/net/smc/smc_core.c > > @@ -815,17 +815,11 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk, > > lnk->lgr = lgr; > > smc_lgr_hold(lgr); /* lgr_put in smcr_link_clear() */ > > lnk->link_idx = link_idx; > > - lnk->wr_rx_id_compl = 0; > > smc_ibdev_cnt_inc(lnk); > > smcr_copy_dev_info_to_link(lnk); > > atomic_set(&lnk->conn_cnt, 0); > > smc_llc_link_set_uid(lnk); > > INIT_WORK(&lnk->link_down_wrk, smc_link_down_work); > > - if (!lnk->smcibdev->initialized) { > > - rc = (int)smc_ib_setup_per_ibdev(lnk->smcibdev); > > - if (rc) > > - goto out; > > - } > > get_random_bytes(rndvec, sizeof(rndvec)); > > lnk->psn_initial = rndvec[0] + (rndvec[1] << 8) + > > (rndvec[2] << 16); > > @@ -1373,7 +1367,7 @@ void smcr_link_clear(struct smc_link *lnk, bool log) > > smc_llc_link_clear(lnk, log); > > smcr_buf_unmap_lgr(lnk); > > smcr_rtoken_clear_link(lnk); > > - smc_ib_modify_qp_error(lnk); > > + smc_wr_drain_qp(lnk); > > smc_wr_free_link(lnk); > > smc_ib_destroy_queue_pair(lnk); > > smc_ib_dealloc_protection_domain(lnk); > > diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h > > index 5c18f08a4c8a..00468b7a279f 100644 > > --- a/net/smc/smc_core.h > > +++ b/net/smc/smc_core.h > > @@ -92,6 +92,12 @@ struct smc_rdma_wr { /* work requests per message > > struct ib_rdma_wr wr_tx_rdma[SMC_MAX_RDMA_WRITES]; > > }; > > > > +struct smc_ib_recv_wr { > > + struct ib_cqe cqe; > > + struct ib_recv_wr wr; > > + int index; > > +}; > > + > > This new struct is used for recv only. May I know if we need to have the > same for send(struct smc_ib_send_wr)? e.g struct smc_ib_send_wr > *wr_tx_ibs member in struct smc_link? > I placed the send ib_cqe in smc_wr_tx_pend in this version. However, I agree that introducing a symmetric smc_ib_send_wr is much more reasonable. I'll update this in the next version. > > #define SMC_LGR_ID_SIZE 4 > > > > struct smc_link { > > @@ -100,6 +106,8 @@ struct smc_link { > > struct ib_pd *roce_pd; /* IB protection domain, > > * unique for every RoCE QP > > */ > > + int nr_cqe; > > This should be an unsigned int, as seen in ib_cq_pool_put(2nd arg). > > > + struct ib_cq *ib_cq; > > My suggestion would be to have a comment next to these 2 new members as > we have the same for other members in this structure. > Take it. > > struct ib_qp *roce_qp; /* IB queue pair */ > > struct ib_qp_attr qp_attr; /* IB queue pair attributes */ > > > > @@ -107,6 +115,7 @@ struct smc_link { > > struct ib_send_wr *wr_tx_ibs; /* WR send meta data */ > > struct ib_sge *wr_tx_sges; /* WR send gather meta data */ > > struct smc_rdma_sges *wr_tx_rdma_sges;/*RDMA WRITE gather meta data*/ > > + struct ib_cqe tx_rdma_cqe; /* CQE RDMA WRITE */ > > struct smc_rdma_wr *wr_tx_rdmas; /* WR RDMA WRITE */ > > struct smc_wr_tx_pend *wr_tx_pends; /* WR send waiting for CQE */ > > struct completion *wr_tx_compl; /* WR send CQE completion */ > > @@ -116,7 +125,6 @@ struct smc_link { > > struct smc_wr_tx_pend *wr_tx_v2_pend; /* WR send v2 waiting for CQE */ > > dma_addr_t wr_tx_dma_addr; /* DMA address of wr_tx_bufs */ > > dma_addr_t wr_tx_v2_dma_addr; /* DMA address of v2 tx buf*/ > > - atomic_long_t wr_tx_id; /* seq # of last sent WR */ > > unsigned long *wr_tx_mask; /* bit mask of used indexes */ > > u32 wr_tx_cnt; /* number of WR send buffers */ > > wait_queue_head_t wr_tx_wait; /* wait for free WR send buf */ > > @@ -126,7 +134,7 @@ struct smc_link { > > struct completion tx_ref_comp; > > > > u8 *wr_rx_bufs; /* WR recv payload buffers */ > > - struct ib_recv_wr *wr_rx_ibs; /* WR recv meta data */ > > + struct smc_ib_recv_wr *wr_rx_ibs; /* WR recv meta data */ > > struct ib_sge *wr_rx_sges; /* WR recv scatter meta data */ > > /* above three vectors have wr_rx_cnt elements and use the same index */ > > int wr_rx_sge_cnt; /* rx sge, V1 is 1, V2 is either 2 or 1 */ > > @@ -135,13 +143,11 @@ struct smc_link { > > */ > > dma_addr_t wr_rx_dma_addr; /* DMA address of wr_rx_bufs */ > > dma_addr_t wr_rx_v2_dma_addr; /* DMA address of v2 rx buf*/ > > - u64 wr_rx_id; /* seq # of last recv WR */ > > - u64 wr_rx_id_compl; /* seq # of last completed WR */ > > u32 wr_rx_cnt; /* number of WR recv buffers */ > > unsigned long wr_rx_tstamp; /* jiffies when last buf rx */ > > - wait_queue_head_t wr_rx_empty_wait; /* wait for RQ empty */ > > > > struct ib_reg_wr wr_reg; /* WR register memory region */ > > + struct ib_cqe wr_reg_cqe; > > wait_queue_head_t wr_reg_wait; /* wait for wr_reg result */ > > struct { > > struct percpu_ref wr_reg_refs; > > diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c > > index 67211d44a1db..77047ad7d452 100644 > > --- a/net/smc/smc_ib.c > > +++ b/net/smc/smc_ib.c > > @@ -112,15 +112,6 @@ int smc_ib_modify_qp_rts(struct smc_link *lnk) > > IB_QP_MAX_QP_RD_ATOMIC); > > } > > > > -int smc_ib_modify_qp_error(struct smc_link *lnk) > > -{ > > - struct ib_qp_attr qp_attr; > > - > > - memset(&qp_attr, 0, sizeof(qp_attr)); > > - qp_attr.qp_state = IB_QPS_ERR; > > - return ib_modify_qp(lnk->roce_qp, &qp_attr, IB_QP_STATE); > > -} > > - > > int smc_ib_ready_link(struct smc_link *lnk) > > { > > struct smc_link_group *lgr = smc_get_lgr(lnk); > > @@ -134,10 +125,7 @@ int smc_ib_ready_link(struct smc_link *lnk) > > if (rc) > > goto out; > > smc_wr_remember_qp_attr(lnk); > > - rc = ib_req_notify_cq(lnk->smcibdev->roce_cq_recv, > > - IB_CQ_SOLICITED_MASK); > > - if (rc) > > - goto out; > > + > > rc = smc_wr_rx_post_init(lnk); > > if (rc) > > goto out; > > @@ -658,38 +646,60 @@ void smc_ib_destroy_queue_pair(struct smc_link *lnk) > > if (lnk->roce_qp) > > ib_destroy_qp(lnk->roce_qp); > > lnk->roce_qp = NULL; > > + if (lnk->ib_cq) { > > + ib_cq_pool_put(lnk->ib_cq, lnk->nr_cqe); > > + lnk->ib_cq = NULL; > > + } > > } > > > > /* create a queue pair within the protection domain for a link */ > > int smc_ib_create_queue_pair(struct smc_link *lnk) > > { > > + int max_send_wr, max_recv_wr, rc; > > + struct ib_cq *cq; > > + > > + /* include unsolicited rdma_writes as well, > > + * there are max. 2 RDMA_WRITE per 1 WR_SEND. > > + * +1 for ib_drain_qp() > > + */ > > + max_send_wr = 3 * lnk->lgr->max_send_wr + 1; > > + max_recv_wr = lnk->lgr->max_recv_wr + 1; > > + > > + cq = ib_cq_pool_get(lnk->smcibdev->ibdev, max_send_wr + max_recv_wr, -1, > > + IB_POLL_SOFTIRQ); > > Should we make the 3rd arg(completion vector hint) as sysctl tunable > variable? In case if at all any customer wants to use it for CPU > locality(high performance networking). This way they have that option to > tweak it. Same logic for 'cq polling context' by using *IB_POLL_DIRECT*. > Just a thought, I don't have a strong opinion on this. I am fine with > the current values also(-1 & IB_POLL_SOFTIRQ). > Good point. But to support different polling contexts (like IB_POLL_DIRECT) requires differ handling logic, we can consider this as a future enhancement. > > + > > + if (IS_ERR(cq)) { > > + rc = PTR_ERR(cq); > > + return rc; > > + } > > + > > struct ib_qp_init_attr qp_attr = { > > .event_handler = smc_ib_qp_event_handler, > > .qp_context = lnk, > > - .send_cq = lnk->smcibdev->roce_cq_send, > > - .recv_cq = lnk->smcibdev->roce_cq_recv, > > + .send_cq = cq, > > + .recv_cq = cq, > > .srq = NULL, > > .cap = { > > .max_send_sge = SMC_IB_MAX_SEND_SGE, > > .max_recv_sge = lnk->wr_rx_sge_cnt, > > + .max_send_wr = max_send_wr, > > + .max_recv_wr = max_recv_wr, > > .max_inline_data = 0, > > }, > > .sq_sig_type = IB_SIGNAL_REQ_WR, > > .qp_type = IB_QPT_RC, > > }; > > - int rc; > > > > - /* include unsolicited rdma_writes as well, > > - * there are max. 2 RDMA_WRITE per 1 WR_SEND > > - */ > > - qp_attr.cap.max_send_wr = 3 * lnk->lgr->max_send_wr; > > - qp_attr.cap.max_recv_wr = lnk->lgr->max_recv_wr; > > lnk->roce_qp = ib_create_qp(lnk->roce_pd, &qp_attr); > > rc = PTR_ERR_OR_ZERO(lnk->roce_qp); > > - if (IS_ERR(lnk->roce_qp)) > > + if (IS_ERR(lnk->roce_qp)) { > > lnk->roce_qp = NULL; > > - else > > + ib_cq_pool_put(cq, max_send_wr + max_recv_wr); > > + } else { > > smc_wr_remember_qp_attr(lnk); > > + lnk->nr_cqe = max_send_wr + max_recv_wr; > > + lnk->ib_cq = cq; > > + } > > return rc; > > } > > > > @@ -855,62 +865,6 @@ void smc_ib_buf_unmap_sg(struct smc_link *lnk, > > buf_slot->sgt[lnk->link_idx].sgl->dma_address = 0; > > } > > > > -long smc_ib_setup_per_ibdev(struct smc_ib_device *smcibdev) > > -{ > > - struct ib_cq_init_attr cqattr = { > > - .cqe = SMC_MAX_CQE, .comp_vector = 0 }; > > - int cqe_size_order, smc_order; > > - long rc; > > - > > - mutex_lock(&smcibdev->mutex); > > - rc = 0; > > - if (smcibdev->initialized) > > - goto out; > > - /* the calculated number of cq entries fits to mlx5 cq allocation */ > > - cqe_size_order = cache_line_size() == 128 ? 7 : 6; > > - smc_order = MAX_PAGE_ORDER - cqe_size_order; > > - if (SMC_MAX_CQE + 2 > (0x00000001 << smc_order) * PAGE_SIZE) > > - cqattr.cqe = (0x00000001 << smc_order) * PAGE_SIZE - 2; > > - smcibdev->roce_cq_send = ib_create_cq(smcibdev->ibdev, > > - smc_wr_tx_cq_handler, NULL, > > - smcibdev, &cqattr); > > - rc = PTR_ERR_OR_ZERO(smcibdev->roce_cq_send); > > - if (IS_ERR(smcibdev->roce_cq_send)) { > > - smcibdev->roce_cq_send = NULL; > > - goto out; > > - } > > - smcibdev->roce_cq_recv = ib_create_cq(smcibdev->ibdev, > > - smc_wr_rx_cq_handler, NULL, > > - smcibdev, &cqattr); > > - rc = PTR_ERR_OR_ZERO(smcibdev->roce_cq_recv); > > - if (IS_ERR(smcibdev->roce_cq_recv)) { > > - smcibdev->roce_cq_recv = NULL; > > - goto err; > > - } > > - smc_wr_add_dev(smcibdev); > > - smcibdev->initialized = 1; > > - goto out; > > - > > -err: > > - ib_destroy_cq(smcibdev->roce_cq_send); > > -out: > > - mutex_unlock(&smcibdev->mutex); > > - return rc; > > -} > > - > > -static void smc_ib_cleanup_per_ibdev(struct smc_ib_device *smcibdev) > > -{ > > - mutex_lock(&smcibdev->mutex); > > - if (!smcibdev->initialized) > > - goto out; > > - smcibdev->initialized = 0; > > - ib_destroy_cq(smcibdev->roce_cq_recv); > > - ib_destroy_cq(smcibdev->roce_cq_send); > > - smc_wr_remove_dev(smcibdev); > > -out: > > - mutex_unlock(&smcibdev->mutex); > > -} > > - > > static struct ib_client smc_ib_client; > > > > static void smc_copy_netdev_ifindex(struct smc_ib_device *smcibdev, int port) > > @@ -969,7 +923,6 @@ static int smc_ib_add_dev(struct ib_device *ibdev) > > INIT_WORK(&smcibdev->port_event_work, smc_ib_port_event_work); > > atomic_set(&smcibdev->lnk_cnt, 0); > > init_waitqueue_head(&smcibdev->lnks_deleted); > > - mutex_init(&smcibdev->mutex); > > mutex_lock(&smc_ib_devices.mutex); > > list_add_tail(&smcibdev->list, &smc_ib_devices.list); > > mutex_unlock(&smc_ib_devices.mutex); > > @@ -1018,7 +971,6 @@ static void smc_ib_remove_dev(struct ib_device *ibdev, void *client_data) > > pr_warn_ratelimited("smc: removing ib device %s\n", > > smcibdev->ibdev->name); > > smc_smcr_terminate_all(smcibdev); > > - smc_ib_cleanup_per_ibdev(smcibdev); > > ib_unregister_event_handler(&smcibdev->event_handler); > > cancel_work_sync(&smcibdev->port_event_work); > > kfree(smcibdev); > > diff --git a/net/smc/smc_ib.h b/net/smc/smc_ib.h > > index ef8ac2b7546d..c5a0d773b73f 100644 > > --- a/net/smc/smc_ib.h > > +++ b/net/smc/smc_ib.h > > @@ -37,10 +37,6 @@ struct smc_ib_device { /* ib-device infos for smc */ > > struct ib_device *ibdev; > > struct ib_port_attr pattr[SMC_MAX_PORTS]; /* ib dev. port attrs */ > > struct ib_event_handler event_handler; /* global ib_event handler */ > > - struct ib_cq *roce_cq_send; /* send completion queue */ > > - struct ib_cq *roce_cq_recv; /* recv completion queue */ > > - struct tasklet_struct send_tasklet; /* called by send cq handler */ > > - struct tasklet_struct recv_tasklet; /* called by recv cq handler */ > > char mac[SMC_MAX_PORTS][ETH_ALEN]; > > /* mac address per port*/ > > u8 pnetid[SMC_MAX_PORTS][SMC_MAX_PNETID_LEN]; > > @@ -96,7 +92,6 @@ void smc_ib_destroy_queue_pair(struct smc_link *lnk); > > int smc_ib_create_queue_pair(struct smc_link *lnk); > > int smc_ib_ready_link(struct smc_link *lnk); > > int smc_ib_modify_qp_rts(struct smc_link *lnk); > > -int smc_ib_modify_qp_error(struct smc_link *lnk); > > long smc_ib_setup_per_ibdev(struct smc_ib_device *smcibdev); > > This declaration is not required any-more as both defination & > invocation of this function are removed by your patch. > Take it. > > int smc_ib_get_memory_region(struct ib_pd *pd, int access_flags, > > struct smc_buf_desc *buf_slot, u8 link_idx); > > diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c > > index 3144b4b1fe29..d301df9ed58b 100644 > > --- a/net/smc/smc_tx.c > > +++ b/net/smc/smc_tx.c > > @@ -321,7 +321,6 @@ static int smc_tx_rdma_write(struct smc_connection *conn, int peer_rmbe_offset, > > struct smc_link *link = conn->lnk; > > int rc; > > > > - rdma_wr->wr.wr_id = smc_wr_tx_get_next_wr_id(link); > > rdma_wr->wr.num_sge = num_sges; > > rdma_wr->remote_addr = > > lgr->rtokens[conn->rtoken_idx][link->link_idx].dma_addr + > > diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c > > index 5feafa98ab1a..3a361aa020ab 100644 > > --- a/net/smc/smc_wr.c > > +++ b/net/smc/smc_wr.c > > @@ -38,7 +38,7 @@ static DEFINE_HASHTABLE(smc_wr_rx_hash, SMC_WR_RX_HASH_BITS); > > static DEFINE_SPINLOCK(smc_wr_rx_hash_lock); > > > > struct smc_wr_tx_pend { /* control data for a pending send request */ > > - u64 wr_id; /* work request id sent */ > > + struct ib_cqe cqe; > > smc_wr_tx_handler handler; > > enum ib_wc_status wc_status; /* CQE status */ > > struct smc_link *link; > > @@ -63,62 +63,51 @@ void smc_wr_tx_wait_no_pending_sends(struct smc_link *link) > > wait_event(link->wr_tx_wait, !smc_wr_is_tx_pend(link)); > > } > > > > -static inline int smc_wr_tx_find_pending_index(struct smc_link *link, u64 wr_id) > > +static void smc_wr_tx_rdma_process_cqe(struct ib_cq *cq, struct ib_wc *wc) > > { > > - u32 i; > > + struct smc_link *link = wc->qp->qp_context; > > > > - for (i = 0; i < link->wr_tx_cnt; i++) { > > - if (link->wr_tx_pends[i].wr_id == wr_id) > > - return i; > > - } > > - return link->wr_tx_cnt; > > + /* terminate link */ > > + if (unlikely(wc->status)) > > + smcr_link_down_cond_sched(link); > > +} > > + > > +static void smc_wr_reg_process_cqe(struct ib_cq *cq, struct ib_wc *wc) > > +{ > > + struct smc_link *link = wc->qp->qp_context; > > + > > + if (wc->status) > > + link->wr_reg_state = FAILED; > > + else > > + link->wr_reg_state = CONFIRMED; > > + smc_wr_wakeup_reg_wait(link); > > } > > > > -static inline void smc_wr_tx_process_cqe(struct ib_wc *wc) > > +static void smc_wr_tx_process_cqe(struct ib_cq *cq, struct ib_wc *wc) > > { > > - struct smc_wr_tx_pend pnd_snd; > > + struct smc_wr_tx_pend *tx_pend, pnd_snd; > > struct smc_link *link; > > u32 pnd_snd_idx; > > > > link = wc->qp->qp_context; > > > > - if (wc->opcode == IB_WC_REG_MR) { > > - if (wc->status) > > - link->wr_reg_state = FAILED; > > - else > > - link->wr_reg_state = CONFIRMED; > > - smc_wr_wakeup_reg_wait(link); > > - return; > > - } > > + tx_pend = container_of(wc->wr_cqe, struct smc_wr_tx_pend, cqe); > > + pnd_snd_idx = tx_pend->idx; > > + > > + tx_pend->wc_status = wc->status; > > + memcpy(&pnd_snd, tx_pend, sizeof(pnd_snd)); > > > > - pnd_snd_idx = smc_wr_tx_find_pending_index(link, wc->wr_id); > > if (pnd_snd_idx == link->wr_tx_cnt) { > > - if (link->lgr->smc_version != SMC_V2 || > > - link->wr_tx_v2_pend->wr_id != wc->wr_id) > > - return; > > - link->wr_tx_v2_pend->wc_status = wc->status; > > - memcpy(&pnd_snd, link->wr_tx_v2_pend, sizeof(pnd_snd)); > > - /* clear the full struct smc_wr_tx_pend including .priv */ > > - memset(link->wr_tx_v2_pend, 0, > > - sizeof(*link->wr_tx_v2_pend)); > > - memset(link->lgr->wr_tx_buf_v2, 0, > > - sizeof(*link->lgr->wr_tx_buf_v2)); > > + memset(link->lgr->wr_tx_buf_v2, 0, sizeof(*link->lgr->wr_tx_buf_v2)); > > } else { > > - link->wr_tx_pends[pnd_snd_idx].wc_status = wc->status; > > - if (link->wr_tx_pends[pnd_snd_idx].compl_requested) > > + if (tx_pend->compl_requested) > > complete(&link->wr_tx_compl[pnd_snd_idx]); > > - memcpy(&pnd_snd, &link->wr_tx_pends[pnd_snd_idx], > > - sizeof(pnd_snd)); > > - /* clear the full struct smc_wr_tx_pend including .priv */ > > - memset(&link->wr_tx_pends[pnd_snd_idx], 0, > > - sizeof(link->wr_tx_pends[pnd_snd_idx])); > > - memset(&link->wr_tx_bufs[pnd_snd_idx], 0, > > - sizeof(link->wr_tx_bufs[pnd_snd_idx])); > > + memset(&link->wr_tx_bufs[tx_pend->idx], 0, sizeof(link->wr_tx_bufs[tx_pend->idx])); > > if (!test_and_clear_bit(pnd_snd_idx, link->wr_tx_mask)) > > return; > > } > > > > - if (wc->status) { > > + if (unlikely(wc->status)) { > > if (link->lgr->smc_version == SMC_V2) { > > memset(link->wr_tx_v2_pend, 0, > > sizeof(*link->wr_tx_v2_pend)); > > @@ -128,44 +117,12 @@ static inline void smc_wr_tx_process_cqe(struct ib_wc *wc) > > /* terminate link */ > > smcr_link_down_cond_sched(link); > > } > > + > > if (pnd_snd.handler) > > pnd_snd.handler(&pnd_snd.priv, link, wc->status); > > wake_up(&link->wr_tx_wait); > > } > > > > -static void smc_wr_tx_tasklet_fn(struct tasklet_struct *t) > > -{ > > - struct smc_ib_device *dev = from_tasklet(dev, t, send_tasklet); > > - struct ib_wc wc[SMC_WR_MAX_POLL_CQE]; > > - int i = 0, rc; > > - int polled = 0; > > - > > -again: > > - polled++; > > - do { > > - memset(&wc, 0, sizeof(wc)); > > - rc = ib_poll_cq(dev->roce_cq_send, SMC_WR_MAX_POLL_CQE, wc); > > - if (polled == 1) { > > - ib_req_notify_cq(dev->roce_cq_send, > > - IB_CQ_NEXT_COMP | > > - IB_CQ_REPORT_MISSED_EVENTS); > > - } > > - if (!rc) > > - break; > > - for (i = 0; i < rc; i++) > > - smc_wr_tx_process_cqe(&wc[i]); > > - } while (rc > 0); > > - if (polled == 1) > > - goto again; > > -} > > - > > -void smc_wr_tx_cq_handler(struct ib_cq *ib_cq, void *cq_context) > > -{ > > - struct smc_ib_device *dev = (struct smc_ib_device *)cq_context; > > - > > - tasklet_schedule(&dev->send_tasklet); > > -} > > - > > /*---------------------------- request submission ---------------------------*/ > > > > static inline int smc_wr_tx_get_free_slot_index(struct smc_link *link, u32 *idx) > > @@ -202,7 +159,6 @@ int smc_wr_tx_get_free_slot(struct smc_link *link, > > struct smc_wr_tx_pend *wr_pend; > > u32 idx = link->wr_tx_cnt; > > struct ib_send_wr *wr_ib; > > - u64 wr_id; > > int rc; > > > > *wr_buf = NULL; > > @@ -226,14 +182,13 @@ int smc_wr_tx_get_free_slot(struct smc_link *link, > > if (idx == link->wr_tx_cnt) > > return -EPIPE; > > } > > - wr_id = smc_wr_tx_get_next_wr_id(link); > > + > > wr_pend = &link->wr_tx_pends[idx]; > > - wr_pend->wr_id = wr_id; > > wr_pend->handler = handler; > > wr_pend->link = link; > > wr_pend->idx = idx; > > wr_ib = &link->wr_tx_ibs[idx]; > > - wr_ib->wr_id = wr_id; > > + wr_ib->wr_cqe = &wr_pend->cqe; > > *wr_buf = &link->wr_tx_bufs[idx]; > > if (wr_rdma_buf) > > *wr_rdma_buf = &link->wr_tx_rdmas[idx]; > > @@ -248,21 +203,18 @@ int smc_wr_tx_get_v2_slot(struct smc_link *link, > > { > > struct smc_wr_tx_pend *wr_pend; > > struct ib_send_wr *wr_ib; > > - u64 wr_id; > > > > if (link->wr_tx_v2_pend->idx == link->wr_tx_cnt) > > return -EBUSY; > > > > *wr_buf = NULL; > > *wr_pend_priv = NULL; > > - wr_id = smc_wr_tx_get_next_wr_id(link); > > wr_pend = link->wr_tx_v2_pend; > > - wr_pend->wr_id = wr_id; > > wr_pend->handler = handler; > > wr_pend->link = link; > > wr_pend->idx = link->wr_tx_cnt; > > wr_ib = link->wr_tx_v2_ib; > > - wr_ib->wr_id = wr_id; > > + wr_ib->wr_cqe = &wr_pend->cqe; > > *wr_buf = link->lgr->wr_tx_buf_v2; > > *wr_pend_priv = &wr_pend->priv; > > return 0; > > @@ -306,8 +258,6 @@ int smc_wr_tx_send(struct smc_link *link, struct smc_wr_tx_pend_priv *priv) > > struct smc_wr_tx_pend *pend; > > int rc; > > > > - ib_req_notify_cq(link->smcibdev->roce_cq_send, > > - IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS); > > pend = container_of(priv, struct smc_wr_tx_pend, priv); > > rc = ib_post_send(link->roce_qp, &link->wr_tx_ibs[pend->idx], NULL); > > if (rc) { > > @@ -323,8 +273,6 @@ int smc_wr_tx_v2_send(struct smc_link *link, struct smc_wr_tx_pend_priv *priv, > > int rc; > > > > link->wr_tx_v2_ib->sg_list[0].length = len; > > - ib_req_notify_cq(link->smcibdev->roce_cq_send, > > - IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS); > > rc = ib_post_send(link->roce_qp, link->wr_tx_v2_ib, NULL); > > if (rc) { > > smc_wr_tx_put_slot(link, priv); > > @@ -367,10 +315,7 @@ int smc_wr_reg_send(struct smc_link *link, struct ib_mr *mr) > > { > > int rc; > > > > - ib_req_notify_cq(link->smcibdev->roce_cq_send, > > - IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS); > > link->wr_reg_state = POSTED; > > - link->wr_reg.wr.wr_id = (u64)(uintptr_t)mr; > > link->wr_reg.mr = mr; > > link->wr_reg.key = mr->rkey; > > rc = ib_post_send(link->roce_qp, &link->wr_reg.wr, NULL); > > @@ -431,94 +376,76 @@ static inline void smc_wr_rx_demultiplex(struct ib_wc *wc) > > { > > struct smc_link *link = (struct smc_link *)wc->qp->qp_context; > > struct smc_wr_rx_handler *handler; > > + struct smc_ib_recv_wr *recv_wr; > > struct smc_wr_rx_hdr *wr_rx; > > - u64 temp_wr_id; > > - u32 index; > > > > if (wc->byte_len < sizeof(*wr_rx)) > > return; /* short message */ > > - temp_wr_id = wc->wr_id; > > - index = do_div(temp_wr_id, link->wr_rx_cnt); > > - wr_rx = (struct smc_wr_rx_hdr *)(link->wr_rx_bufs + index * link->wr_rx_buflen); > > + > > + recv_wr = container_of(wc->wr_cqe, struct smc_ib_recv_wr, cqe); > > + > > + wr_rx = (struct smc_wr_rx_hdr *)(link->wr_rx_bufs + recv_wr->index * link->wr_rx_buflen); > > hash_for_each_possible(smc_wr_rx_hash, handler, list, wr_rx->type) { > > if (handler->type == wr_rx->type) > > handler->handler(wc, wr_rx); > > } > > } > > > > -static inline void smc_wr_rx_process_cqes(struct ib_wc wc[], int num) > > +static void smc_wr_rx_process_cqe(struct ib_cq *cq, struct ib_wc *wc) > > { > > - struct smc_link *link; > > - int i; > > - > > - for (i = 0; i < num; i++) { > > - link = wc[i].qp->qp_context; > > - link->wr_rx_id_compl = wc[i].wr_id; > > - if (wc[i].status == IB_WC_SUCCESS) { > > - link->wr_rx_tstamp = jiffies; > > - smc_wr_rx_demultiplex(&wc[i]); > > - smc_wr_rx_post(link); /* refill WR RX */ > > - } else { > > - /* handle status errors */ > > - switch (wc[i].status) { > > - case IB_WC_RETRY_EXC_ERR: > > - case IB_WC_RNR_RETRY_EXC_ERR: > > - case IB_WC_WR_FLUSH_ERR: > > - smcr_link_down_cond_sched(link); > > - if (link->wr_rx_id_compl == link->wr_rx_id) > > - wake_up(&link->wr_rx_empty_wait); > > - break; > > - default: > > - smc_wr_rx_post(link); /* refill WR RX */ > > - break; > > - } > > + struct smc_link *link = wc->qp->qp_context; > > + > > + if (wc->status == IB_WC_SUCCESS) { > > + link->wr_rx_tstamp = jiffies; > > + smc_wr_rx_demultiplex(wc); > > + smc_wr_rx_post(link, wc->wr_cqe); /* refill WR RX */ > > + } else { > > + /* handle status errors */ > > + switch (wc->status) { > > + case IB_WC_RETRY_EXC_ERR: > > + case IB_WC_RNR_RETRY_EXC_ERR: > > + case IB_WC_WR_FLUSH_ERR: > > + smcr_link_down_cond_sched(link); > > + break; > > + default: > > + smc_wr_rx_post(link, wc->wr_cqe); /* refill WR RX */ > > + break; > > } > > } > > } > > > > -static void smc_wr_rx_tasklet_fn(struct tasklet_struct *t) > > +int smc_wr_rx_post_init(struct smc_link *link) > > { > > - struct smc_ib_device *dev = from_tasklet(dev, t, recv_tasklet); > > - struct ib_wc wc[SMC_WR_MAX_POLL_CQE]; > > - int polled = 0; > > - int rc; > > + u32 i; > > + int rc = 0; > > > > -again: > > - polled++; > > - do { > > - memset(&wc, 0, sizeof(wc)); > > - rc = ib_poll_cq(dev->roce_cq_recv, SMC_WR_MAX_POLL_CQE, wc); > > - if (polled == 1) { > > - ib_req_notify_cq(dev->roce_cq_recv, > > - IB_CQ_SOLICITED_MASK > > - | IB_CQ_REPORT_MISSED_EVENTS); > > - } > > - if (!rc) > > - break; > > - smc_wr_rx_process_cqes(&wc[0], rc); > > - } while (rc > 0); > > - if (polled == 1) > > - goto again; > > + for (i = 0; i < link->wr_rx_cnt; i++) > > + rc = smc_wr_rx_post(link, &link->wr_rx_ibs[i].cqe); > > + return rc; > > } > > > > -void smc_wr_rx_cq_handler(struct ib_cq *ib_cq, void *cq_context) > > -{ > > - struct smc_ib_device *dev = (struct smc_ib_device *)cq_context; > > +/***************************** init, exit, misc ******************************/ > > + > > > > - tasklet_schedule(&dev->recv_tasklet); > > +static inline void smc_wr_reg_init_cqe(struct ib_cqe *cqe) > > +{ > > + cqe->done = smc_wr_reg_process_cqe; > > } > > > > -int smc_wr_rx_post_init(struct smc_link *link) > > +static inline void smc_wr_tx_init_cqe(struct ib_cqe *cqe) > > { > > - u32 i; > > - int rc = 0; > > + cqe->done = smc_wr_tx_process_cqe; > > +} > > > > - for (i = 0; i < link->wr_rx_cnt; i++) > > - rc = smc_wr_rx_post(link); > > - return rc; > > +static inline void smc_wr_rx_init_cqe(struct ib_cqe *cqe) > > +{ > > + cqe->done = smc_wr_rx_process_cqe; > > } > > > > -/***************************** init, exit, misc ******************************/ > > +static inline void smc_wr_tx_rdma_init_cqe(struct ib_cqe *cqe) > > +{ > > + cqe->done = smc_wr_tx_rdma_process_cqe; > > +} > > > > void smc_wr_remember_qp_attr(struct smc_link *lnk) > > { > > @@ -548,9 +475,9 @@ void smc_wr_remember_qp_attr(struct smc_link *lnk) > > &init_attr); > > > > lnk->wr_tx_cnt = min_t(size_t, lnk->max_send_wr, > > - lnk->qp_attr.cap.max_send_wr); > > + lnk->qp_attr.cap.max_send_wr - 1); > > lnk->wr_rx_cnt = min_t(size_t, lnk->max_recv_wr, > > - lnk->qp_attr.cap.max_recv_wr); > > + lnk->qp_attr.cap.max_recv_wr - 1); > > } > > > > static void smc_wr_init_sge(struct smc_link *lnk) > > @@ -585,6 +512,8 @@ static void smc_wr_init_sge(struct smc_link *lnk) > > lnk->wr_tx_rdma_sges[i].tx_rdma_sge[0].wr_tx_rdma_sge; > > lnk->wr_tx_rdmas[i].wr_tx_rdma[1].wr.sg_list = > > lnk->wr_tx_rdma_sges[i].tx_rdma_sge[1].wr_tx_rdma_sge; > > + lnk->wr_tx_rdmas[i].wr_tx_rdma[0].wr.wr_cqe = &lnk->tx_rdma_cqe; > > + lnk->wr_tx_rdmas[i].wr_tx_rdma[1].wr.wr_cqe = &lnk->tx_rdma_cqe; > > } > > > > if (lnk->lgr->smc_version == SMC_V2) { > > @@ -622,10 +551,13 @@ static void smc_wr_init_sge(struct smc_link *lnk) > > lnk->wr_rx_sges[x + 1].lkey = > > lnk->roce_pd->local_dma_lkey; > > } > > - lnk->wr_rx_ibs[i].next = NULL; > > - lnk->wr_rx_ibs[i].sg_list = &lnk->wr_rx_sges[x]; > > - lnk->wr_rx_ibs[i].num_sge = lnk->wr_rx_sge_cnt; > > + lnk->wr_rx_ibs[i].wr.next = NULL; > > + lnk->wr_rx_ibs[i].wr.sg_list = &lnk->wr_rx_sges[x]; > > + lnk->wr_rx_ibs[i].wr.num_sge = lnk->wr_rx_sge_cnt; > > } > > + > > + smc_wr_reg_init_cqe(&lnk->wr_reg_cqe); > > + lnk->wr_reg.wr.wr_cqe = &lnk->wr_reg_cqe; > > lnk->wr_reg.wr.next = NULL; > > lnk->wr_reg.wr.num_sge = 0; > > lnk->wr_reg.wr.send_flags = IB_SEND_SIGNALED; > > @@ -641,7 +573,6 @@ void smc_wr_free_link(struct smc_link *lnk) > > return; > > ibdev = lnk->smcibdev->ibdev; > > > > - smc_wr_drain_cq(lnk); > > smc_wr_wakeup_reg_wait(lnk); > > smc_wr_wakeup_tx_wait(lnk); > > > > @@ -758,11 +689,19 @@ int smc_wr_alloc_link_mem(struct smc_link *link) > > GFP_KERNEL); > > if (!link->wr_rx_ibs) > > goto no_mem_wr_tx_ibs; > > + /* init wr_rx_ibs cqe */ > > Should you do the same init for link->wr_tx_ibs also? > Not in this version, because the send ib_cqe is embedded in smc_wr_tx_pend. However, since I will introduce smc_ib_send_wr in the next version, I will add the corresponding initialization for it here. > > + for (int i = 0; i < link->max_recv_wr; i++) { > > + smc_wr_rx_init_cqe(&link->wr_rx_ibs[i].cqe); > > + link->wr_rx_ibs[i].wr.wr_cqe = &link->wr_rx_ibs[i].cqe; > > + link->wr_rx_ibs[i].index = i; > > + } > > link->wr_tx_rdmas = kcalloc(link->max_send_wr, > > sizeof(link->wr_tx_rdmas[0]), > > GFP_KERNEL); > > if (!link->wr_tx_rdmas) > > goto no_mem_wr_rx_ibs; > > + > > + smc_wr_tx_rdma_init_cqe(&link->tx_rdma_cqe); > > link->wr_tx_rdma_sges = kcalloc(link->max_send_wr, > > sizeof(link->wr_tx_rdma_sges[0]), > > GFP_KERNEL); > > @@ -785,6 +724,8 @@ int smc_wr_alloc_link_mem(struct smc_link *link) > > GFP_KERNEL); > > if (!link->wr_tx_pends) > > goto no_mem_wr_tx_mask; > > + for (int i = 0; i < link->max_send_wr; i++) > > + smc_wr_tx_init_cqe(&link->wr_tx_pends[i].cqe); > > link->wr_tx_compl = kcalloc(link->max_send_wr, > > sizeof(link->wr_tx_compl[0]), > > GFP_KERNEL); > > @@ -804,6 +745,7 @@ int smc_wr_alloc_link_mem(struct smc_link *link) > > GFP_KERNEL); > > if (!link->wr_tx_v2_pend) > > goto no_mem_v2_sge; > > + smc_wr_tx_init_cqe(&link->wr_tx_v2_pend->cqe); > > } > > return 0; > > > > @@ -837,18 +779,6 @@ int smc_wr_alloc_link_mem(struct smc_link *link) > > return -ENOMEM; > > } > > > > -void smc_wr_remove_dev(struct smc_ib_device *smcibdev) > > -{ > > - tasklet_kill(&smcibdev->recv_tasklet); > > - tasklet_kill(&smcibdev->send_tasklet); > > -} > > - > > -void smc_wr_add_dev(struct smc_ib_device *smcibdev) > > -{ > > - tasklet_setup(&smcibdev->recv_tasklet, smc_wr_rx_tasklet_fn); > > - tasklet_setup(&smcibdev->send_tasklet, smc_wr_tx_tasklet_fn); > > -} > > - > > static void smcr_wr_tx_refs_free(struct percpu_ref *ref) > > { > > struct smc_link *lnk = container_of(ref, struct smc_link, wr_tx_refs); > > @@ -868,8 +798,6 @@ int smc_wr_create_link(struct smc_link *lnk) > > struct ib_device *ibdev = lnk->smcibdev->ibdev; > > int rc = 0; > > > > - smc_wr_tx_set_wr_id(&lnk->wr_tx_id, 0); > > - lnk->wr_rx_id = 0; > > lnk->wr_rx_dma_addr = ib_dma_map_single( > > ibdev, lnk->wr_rx_bufs, lnk->wr_rx_buflen * lnk->wr_rx_cnt, > > DMA_FROM_DEVICE); > > @@ -917,7 +845,6 @@ int smc_wr_create_link(struct smc_link *lnk) > > if (rc) > > goto cancel_ref; > > init_completion(&lnk->reg_ref_comp); > > - init_waitqueue_head(&lnk->wr_rx_empty_wait); > > return rc; > > > > cancel_ref: > > diff --git a/net/smc/smc_wr.h b/net/smc/smc_wr.h > > index aa4533af9122..4268dfcd84d3 100644 > > --- a/net/smc/smc_wr.h > > +++ b/net/smc/smc_wr.h > > @@ -44,19 +44,6 @@ struct smc_wr_rx_handler { > > u8 type; > > }; > > > > -/* Only used by RDMA write WRs. > > - * All other WRs (CDC/LLC) use smc_wr_tx_send handling WR_ID implicitly > > - */ > > -static inline long smc_wr_tx_get_next_wr_id(struct smc_link *link) > > -{ > > - return atomic_long_inc_return(&link->wr_tx_id); > > -} > > - > > -static inline void smc_wr_tx_set_wr_id(atomic_long_t *wr_tx_id, long val) > > -{ > > - atomic_long_set(wr_tx_id, val); > > -} > > - > > static inline bool smc_wr_tx_link_hold(struct smc_link *link) > > { > > if (!smc_link_sendable(link)) > > @@ -70,9 +57,10 @@ static inline void smc_wr_tx_link_put(struct smc_link *link) > > percpu_ref_put(&link->wr_tx_refs); > > } > > > > -static inline void smc_wr_drain_cq(struct smc_link *lnk) > > +static inline void smc_wr_drain_qp(struct smc_link *lnk) > > { > > - wait_event(lnk->wr_rx_empty_wait, lnk->wr_rx_id_compl == lnk->wr_rx_id); > > + if (lnk->qp_attr.cur_qp_state != IB_QPS_RESET) > > Should you call *ib_drain_qp* only for IB_QPS_RTS(Ready to Send) & > IB_QPS_SQD(Send Queue Draining) ib_qp_state? > No need. ib_drain_qp() handles state checking internally. The IB_QPS_RESET check here is purely to avoid warning logs from the RDMA core. Logically, omitting it wouldn't cause any issues at all. > > + ib_drain_qp(lnk->roce_qp); > > } > Did you enable dynamic_debug for drivers/infiniband/core/cq.c and SMC to > confirm CQ polling paths are active and no WARNs on teardown? > > I followed below steps but I couldn't see the relevent logs in dmesg. > Not sure if I missed something. > I haven't tried dynamic_debug for this. Normally, I use BPF to confirm if the CQ polling paths are active. > # Mount debugfs if not already mounted > mount -t debugfs none /sys/kernel/debug > > # Enable debug for InfiniBand CQ handling (cq.c) > echo "file drivers/infiniband/core/cq.c +p" > > /sys/kernel/debug/dynamic_debug/control > > # Enable debug for SMC (Sockets over RDMA) > echo "module smc +p" > /sys/kernel/debug/dynamic_debug/control > > # Increase loglevel to ensure debug messages show on console > echo 8 > /proc/sys/kernel/printk > > # Monitor dmesg for CQ related logs > dmesg -w | grep -iE "cq|ib_core|smc" ----> I didn't see relevent logs > while I ran continuous SMC-R traffic