From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 491EBC433F5 for ; Fri, 13 May 2022 02:40:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1358954AbiEMCkW (ORCPT ); Thu, 12 May 2022 22:40:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46116 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230035AbiEMCkV (ORCPT ); Thu, 12 May 2022 22:40:21 -0400 Received: from out2.migadu.com (out2.migadu.com [IPv6:2001:41d0:2:aacc::]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C22F52230A0 for ; Thu, 12 May 2022 19:40:19 -0700 (PDT) Message-ID: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1652409617; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=NYz60hL4alH8vhhndoEfpFPusDAEedxD6i+4B4lLYes=; b=lgHZs5+oP5v1r8Ql1gKDzjs2cgavM3xcljWdrgt+qJt3s930QVYrVLkrZgIH9UqTS2fvfU jeH2elVtdJGXIpYpUXJvGg4bisWqEY6AmorR3psHcMZ2gc5zHnFJWGqZXxz5VB+UiqxUtU JmrNqvlWv959Enh0deXSyKq2T0hNLTg= Date: Fri, 13 May 2022 10:40:10 +0800 MIME-Version: 1.0 Subject: Re: [PATCH for-rc] RDMA/rxe: Fix rnr retry behavior To: Bob Pearson , jgg@nvidia.com, zyjzyj2000@gmail.com, linux-rdma@vger.kernel.org References: <20220512194901.76696-1-rpearsonhpe@gmail.com> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yanjun Zhu In-Reply-To: <20220512194901.76696-1-rpearsonhpe@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT X-Migadu-Auth-User: linux.dev Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org 在 2022/5/13 3:49, Bob Pearson 写道: > Currently the completer tasklet when it sets the retransmit timer or the > nak timer sets the same flag (qp->req.need_retry) so that if either > timer fires it will attempt to perform a retry flow on the send queue. > This has the effect of responding to an RNR NAK at the first retransmit > timer event which does not allow for the requested rnr timeout. > > This patch adds a new flag (qp->req.need_rnr_timeout) which, if set, > prevents a retry flow until the rnr nak timer fires. > > This patch fixes rnr retry errors which can be observed by running the > pyverbs test suite 50-100X. With this patch applied they do not occur. Do you mean that running run_tests.py for 50-100times can reproduce this bug? I want to reproduce this problem. Thanks a lot. Zhu Yanjun > > Link: https://lore.kernel.org/linux-rdma/a8287823-1408-4273-bc22-99a0678db640@gmail.com/ > Fixes: 8700e3e7c485 ("Soft RoCE (RXE) - The software RoCE driver") > Signed-off-by: Bob Pearson > --- > drivers/infiniband/sw/rxe/rxe_comp.c | 4 +--- > drivers/infiniband/sw/rxe/rxe_qp.c | 1 + > drivers/infiniband/sw/rxe/rxe_req.c | 6 ++++-- > drivers/infiniband/sw/rxe/rxe_verbs.h | 1 + > 4 files changed, 7 insertions(+), 5 deletions(-) > > diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c > index 138b3e7d3a5f..bc668cb211b1 100644 > --- a/drivers/infiniband/sw/rxe/rxe_comp.c > +++ b/drivers/infiniband/sw/rxe/rxe_comp.c > @@ -733,9 +733,7 @@ int rxe_completer(void *arg) > if (qp->comp.rnr_retry != 7) > qp->comp.rnr_retry--; > > - qp->req.need_retry = 1; > - pr_debug("qp#%d set rnr nak timer\n", > - qp_num(qp)); > + qp->req.need_rnr_timeout = 1; > mod_timer(&qp->rnr_nak_timer, > jiffies + rnrnak_jiffies(aeth_syn(pkt) > & ~AETH_TYPE_MASK)); > diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c > index 62acf890af6c..1c962468714e 100644 > --- a/drivers/infiniband/sw/rxe/rxe_qp.c > +++ b/drivers/infiniband/sw/rxe/rxe_qp.c > @@ -513,6 +513,7 @@ static void rxe_qp_reset(struct rxe_qp *qp) > atomic_set(&qp->ssn, 0); > qp->req.opcode = -1; > qp->req.need_retry = 0; > + qp->req.need_rnr_timeout = 0; > qp->req.noack_pkts = 0; > qp->resp.msn = 0; > qp->resp.opcode = -1; > diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c > index ae5fbc79dd5c..770ae4279f73 100644 > --- a/drivers/infiniband/sw/rxe/rxe_req.c > +++ b/drivers/infiniband/sw/rxe/rxe_req.c > @@ -103,7 +103,8 @@ void rnr_nak_timer(struct timer_list *t) > { > struct rxe_qp *qp = from_timer(qp, t, rnr_nak_timer); > > - pr_debug("qp#%d rnr nak timer fired\n", qp_num(qp)); > + qp->req.need_retry = 1; > + qp->req.need_rnr_timeout = 0; > rxe_run_task(&qp->req.task, 1); > } > > @@ -624,10 +625,11 @@ int rxe_requester(void *arg) > qp->req.need_rd_atomic = 0; > qp->req.wait_psn = 0; > qp->req.need_retry = 0; > + qp->req.need_rnr_timeout = 0; > goto exit; > } > > - if (unlikely(qp->req.need_retry)) { > + if (unlikely(qp->req.need_retry && !qp->req.need_rnr_timeout)) { > req_retry(qp); > qp->req.need_retry = 0; > } > diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h > index e7eff1ca75e9..ab3186478c3f 100644 > --- a/drivers/infiniband/sw/rxe/rxe_verbs.h > +++ b/drivers/infiniband/sw/rxe/rxe_verbs.h > @@ -123,6 +123,7 @@ struct rxe_req_info { > int need_rd_atomic; > int wait_psn; > int need_retry; > + int need_rnr_timeout; > int noack_pkts; > struct rxe_task task; > }; > > base-commit: c5eb0a61238dd6faf37f58c9ce61c9980aaffd7a